A Structured Model For Action Detection

12/09/2018 ∙ by Yubo Zhang, et al. ∙ Inria Carnegie Mellon University 0

A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand. While this is an obviously attractive approach, it is not applicable in all scenarios. We claim that action detection is one such challenging problem - the models that need to be trained are large, and labeled data is expensive to obtain. To address this limitation, we propose to incorporate domain knowledge into the structure of the model, simplifying optimization. In particular, we augment a standard I3D network with a tracking module to aggregate long term motion patterns, and use a graph convolutional network to reason about interactions between actors and objects. Evaluated on the challenging AVA dataset, the proposed approach improves over the I3D baseline by 5.5



There are no comments yet.


page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the video sequence from the AVA dataset [14] shown in Figure 1. It shows a person getting up and then receiving a letter from another person, who is seated behind a table. Out of the 2359296 pixels in the 36 frames of this clip, what information is actually important for recognizing and localizing this action? Key cues include the location of the actor, his motion, and his interactions with the other actor and the letter. The rest of the video content, such as the color of the walls or the lamp on the table are irrelevant and should be marginalized over. We use these intuitive observations to design a new method for action detection.

State-of-the-art action detection approaches put a lot of emphasis on actor localization [12, 14, 46], but other cues are largely ignored. For instance Gkioxari et al. [12] use a state-of-the-art 2D-object detection framework [16] to localize all the humans in every frame of the video and then use static appearance features to recognize their activities. Gu et al., extend this approach by replacing the static appearance features with an I3D [3] representation that is capable of capturing short-term motion patterns. This allows them to achieve a significant improvement on the challenging AVA dataset, even while their performance on activities with large temporal extent remains poor. In our method, we aggregate local I3D features over actor tracks, which results in a significant boost in performance.

Figure 1:

For action detection, it is critical to capture both the long-term temporal information and spatial relationships between actors and objects. We propose to incorporate this domain knowledge into the architecture of deep learning models for action detection.

Sun et al. [46] addressed the problem of modeling human-human and human-object interaction, by applying relational networks to explicitly capture interaction between actors and objects in a scene. Their method, however, does not directly model objects, but instead considers every pixel in the frame to be an object proxy. While this approach is indeed generic and object-category agnostic, we argue that the lack of proper object modeling hinders its performance. In a concurrent work to [46], Wang et al. [52] use object proposals to localize the regions of interest and then employ graph convolutional networks [25] to combine the actor and object representations and produce video-level action classification. However, their approach does not address the action detection problem. In our method we also model activities with actor-object graphs, but instead of aggregating features over all the objects and actors in a scene we propose to structurally modeling actor-object and actor-actor separately during both training and testing. Other works that propose to capture action recognition with actor-object graphs include [37, 21]. These methods, however, require ground truth annotations of both actors and objects during training and focus on a closed vocabulary of object categories. Our method addresses both of these limitations by first adopting a weakly-supervised object detection approach for localizing the correct objects during training time without explicit supervision, and secondly proposing a simple modification to the state-of-the-art object detection framework [16] which makes it category agnostic.

In this work we propose a model for action detection in videos that explicitly models long-term human behaviour, as well as human-human and human-object interactions. In particular, our model extracts I3D [3] features for the frames in a video sequence and, in parallel, detects persons and objects with an object detection approach modified from He et al. [16] (Sec 3.1). It then tracks every actor over a 3-second interval producing a set of tubelets, e.g. sequences of bounding boxes over time [23, 24]

. To this end a simple and efficient heuristic tracker is proposed (Sec

3.2.1). The tubelets are then combined with the detected objects to construct an actor-centric graph (Sec 3.2.2

). Features from an I3D frame encoding are pooled to obtain a representation for the nodes. Every edge in the graph captures a possible human-human or human-object interaction. A classifier is then trained on the edge features to produce the final predictions. Naively, such an approach requires ground truth object annotation to train. To remove this requirement we build on intuition from weakly-supervised object detection and learn to integrate useful information from the objects at training time automatically.

To summarize, this work has two main contributions:

  1. We propose a new method for action detection that explicitly captures long-term behaviour as well as human-human and human-object interactions.

  2. We demonstration state-of-the-art results on the challenging AVA dataset, improving over the best published method by 4.8%, and provide a comprehensive ablative analysis of our approach.

2 Related work

Action classification is one of the fundamental problems in computer vision. Early approaches relied on hand-crafted features [50] that track pixels over time and then aggregated their motion statistics into compact video descriptors. With the onset of deep learning these methods have been outperformed by two-stream networks [44] that take both raw images and optical flow fields as input to CNNs [28], which are trained end-to-end on large datasets. These methods are limited by the 2D nature of CNN representations. This limitation has been addressed by Tran et al. [49] who extended CNN filters to the temporal dimension resulting in 3D convolutional networks. More recently, Carreira and Zisserman [3] have integrated 3D convolutions into a state-of-the-art 2D CNN architecture [47], resulting in Inflated 3D ConvNet (I3D). Wang et al. [51], have extended this architecture with non-local blocks that facilitate fine-grained action recognition. We use an I3D with non-local blocks as the video feature representation in our model.

Action localization can refer to spatial, temporal, or spatio-temporal localization of actions in videos. In this work we study the problem of spatial action localization. Early action detection methods [26, 36] generate hand-crafted features from video and train SVM classifier. Early deep-learning based action localization models [13, 34, 41, 45, 53] are developed on top of 2D object detection architectures. They detect actors in every frame and recognize activities using 2D appearance features. Kalogeiton et al. [23] proposed to predict short tubelets instead of boxes, taking several frames as input, but their model is still based on a 2D CNN feature representation and does not aggregate the features over the tubelet. Instead it is only used for temporal localization. In Li et al. [29] the authors apply an LSTM [9] on top of the tubelet features to exploit long-term temporal information for action detection. However, their model also relies on a 2D representation and is not trained end-to-end. TCNN [20] uses C3D as a feature representation for action localization, but they only extract features for a single bounding box in the middle of a short sequence of frames. Finally, Gu et al. [14]

propose to use I3D as a feature representation, which takes longer video sequences as input, but also does not aggregate the features over a tubelet. Our model builds upon the success of I3D for feature extraction. We note that most actions do not remain in the same location across frames. Instead of extracting I3D feature of the whole video from a single location, we track actors based on their appearance and their feature representations along the whole video clip, which enables learning discriminate features for actions with long temporal dependency.

Figure 2: Overview of our proposed framework. We model both long-term person behaviour and human-human, human-object interactions structurally in a unified framework. The actors across the video are associated to generate actor tubelets for learning long temporal dependency. The features from actor tubelets and object proposals are then used to construct a relation graph to model human-object manipulation and human-human interaction actions. The output of our model are actor-centric actions.

Object detection is a key component of most of the action detection frameworks. Traditional approaches relied on hand-crafted features and part-based models [8]. Modern deep-learning based methods are either based on RCNN-like [11, 10, 39, 16], or SSD-like architectures [31, 38]. In our model, we use Mask-RCNN [16] for person and object detection. One limitation of this approach is that it is trained on a closed vocabulary of 80 object categories in COCO. In the context of human-object interaction recognition, we want to detect any objects that participate in interactions. To this end we propose a simple modification of the training procedure of Mask-RCNN which makes the model category-agnostic.

Object tracking is a well studied problem. Traditional tracking algorithms [1, 22, 17] used hand-crafted appearance features to perform online tracking of the bounding box in the first frame. Despite their efficiency, the performance of these methods on realistic videos is sub-optimal. State-of-the-art, deep learning-based trackers [32, 19, 56, 7, 48] demonstrate a better performance and are more robust. Ma et al. [32] find that the last layers of CNNs encode semantic abstraction of a target that is invariant to appearance changes. Feichtenhofer et al. [7] use R-FCN [6] and correlation filters for simultaneous region of interest detection and tracking which is robust to target size change. Our tracking module, following the tracking by detection paradigm, first detects all humans in consecutive video frames. For efficiency reasons, instead of online fine-tuning the model on the detected actors in the first frame, we propose to train siamese-network [2] offline with a triplet loss. We then use the resulting discriminative features for matching the boxes in consecutive frames.

Visual relationship modeling for human-human and human-object pairs increases performance in a variety of tasks including action recognition [52] and image captioning [35, 33]. There have been several works [4, 12, 15] on human-object interaction modeling in images that achieved significant performance improvements on HICO-DET [5] and V-COCO [30] datasets. Recently, Qi et al. [37] propose a framework for action localization in videos which represents humans, objects and their interactions with a graphical model. It then uses convolutional LSTMs [54] to model the evolution of the graph over time. Their model, however, uses a 2D CNN for feature representation, requires ground truth annotations of the object boxes for training and is only evaluated on a toy dataset [27]. Our model does not require object annotations which allows us to demonstrate results in a more realistic scenario. Similarly to us, Sun et al. [46] propose to implicitly model the interactions between actors and objects without object annotations for training. To this end they use relational networks [42] which avoid explicitly modeling objects by treating each location in an image as an object proxy and aggregating the representation across all the locations. In our evaluation we show that explicit modeling of objects and integration of the right object in a frame allow us to learn more discriminative features and achieve a state-of-the-art performance on AVA dataset.

3 Methods

We propose a method for action detection in videos that explicitly models the long-term behaviour of individual people, along with human-human and human-object interactions. The architecture of our model is shown in Figure 3. It takes a sequence of video frames as input (a) and passes them through an I3D network (b). In parallel, a state-of-the-art object detection model [16] (c) is applied to each frame to produce human and object bounding boxes. Human bounding boxes are then combined into tubelets (a sequence of bounding boxes over time) (d) with an association module. The tubelets (as edges) and object boxes (as nodes) are then used to construct an actor-centric graph for every actor in the video clip (e).

In the actor-centric graph, we define two kinds of nodes, the actor node and the object node, along with two kinds of edges, representing human-object manipulation and human-human interaction. The object nodes are generated by performing Region of Interest (ROI) Pooling from the I3D representation. The actor nodes, whose temporal behavior we wish to model, are obtained by aggregating I3D features with graph convolutions over the corresponding tubelets. The features from the graph edges are used as the final representation for action classification. The whole model, except for the 2D object detector, is trained in an end-to-end fashion requiring only actor bounding boxes and ground truth action. In the rest of this section, we will first present our models for video representation and object detection. Then, we explain how we integrate temporal information using an appearance-based multi-object tracking module. Finally, we will demonstrate how we build the actor-centric graph, and how it is used to generate action predictions.

3.1 Spatio-temporal feature extraction

The first step in our action detection pipeline is to extract two sets of features from videos: an unstructured video embedding, and a collection of object & actor region proposals.

Unstructured video embedding To exploit the spatio-temporal structure of the video input, we use an inflated 3D ConvNet (I3D) with non-local layers [52]. In a 3D ConvNet, videos are modeled as a dense sampling of coordinates, and the corresponding learned filters operate in both spatial and temporal domains, thus capturing short-term motion patterns. We also use non-local layers [51] to aggregate features across the entire image, allowing our network to reason beyond the extend of a local convolutional filters. In our scenario, the input is a 3 seconds video clip with 36 frames. Our final video embedding retains its temporal dimension, enabling us to explicitly use temporal information in the later stages of our model.

Appearance based actors/objects proposal We take advantage of the success of RCNN-like [39] models for object detection to identify regions of interest. In our model, we are interested in identifying the spatial location of the actors and potential objects that are being manipulated by them. Since our goal is to understand actions performed by human actors, independent of the categories of objects being manipulated, we do not explicitly use the object category information from detection network. More specifically, we train Mask-RCNN [16] on MS-COCO[30] by collapsing all the category labels into a single object label, resulting in a category-agnostic object detector. The properties of this model are analyzed in more detail in a concurrent submission. We use a standard person detector for localizing the actors [16].

3.2 Action detection with temporal context

To enable our action detection system capturing long-term temporal dependency, we integrate multi-object tracking into our action detection framework. Instead of having explicit action proposals, with an action prediction and ROI proposal, we propose to simultaneously track each actor across frames in the whole video. Then, with the actor appearance information stored in a node and tracking information in edges, we aggregate each actor’s movement by using graph convolutions.

3.2.1 Multi-actor association module

We note that some actions are composed of multiple movements, for example, the action ’get up’ is composed of siting, moving upward, and standing. Previous methods that recognize actions from a few frames and link them via actioness score  [45] are not able to maintain consistent tracks, since, unlike the appearance features, the features of a model trained for action recognition differ significantly across frames due to the actor’s movement. We posit that confidently tracking actors across multiple frames and integrating these local representations in a principled way is crucial for learning discriminative representations for actions that are composed of multiple movements.

Motivated by this observation, we introduce a multi-actor association module that aims to associate the bounding box proposals of each actor throughout the video clip. Instead of linking action bounding box proposals based on actionness score, we associate actor bounding boxes based on the similarity of actor appearance feature.

We follow the tracking-by-detection paradigm, and build an association module to perform this linking. Specifically, we first train an appearance feature encoding then explicitly search neighbor regions in the next time stamp for appearance match. To learn an appearance feature encoding for distinguishing different actors, we train a Siamese network [18] with triplet loss [43]. After we obtain appearance feature encoding, we search among the bounding box proposals in consecutive frames and match the bounding boxes with highest appearance similarity.

3.2.2 Actor tubelet learning using graphs

Recent works in action detection attempt to predict action directly from the features extracted from I3D [14]. We claim that integrating I3D features over multiple frames is crucial for recognizing long-term activities. A naive approach would be to simply average these features along the temporal dimension. Instead we propose to model the behavior of each actor with graph convolutional networks [25]. We obtain the nodes of the person graph with features extracted from an I3D backbone with RoIAlign [16] and obtain the edges though tubelets from our multi-actor association module. While performing graph convolutions, the movement information of each actor is passed and aggregated through the graph. Formally, assuming there are actors in the video. Each actor is represented by feature with dimension . is the temporal dimension. We denote for actor tubelet graph with dimension, and for actor feature with dimension. The graph convolution operation is shown as


where is the matrix of weights with dimension . The output of the graph is which encodes the actors’ features along the temporal axis. The graph convolution operation can also be stacked in multiple layers to learn more discriminative features.

3.3 Interactions between actors and objects

To recognize actions associated with interactions, it is critical to exploit the relations between the actor of interest, other actors, and objects in the scene. However, modeling all such possible relationships can become intractable. We propose to use class-agnostic features from ROI proposals to build a relation graph and implicitly perform relation reasoning given only action annotations.

To integrate information from the other actors and objects to learn interaction action features, we construct two relation graphs, one to model human-object manipulation and the other one to model human-human interaction. The human-object graph connects each actor of interest with the other objects and the human-human graph connects each actor of interest with the other actors. The features of actor nodes come from the actor tubelets after multi-actor association module and we denote them with where is the number of actors in the middle frame. The features of objects are generated from ROI pooling after I3D which is represented as where is the number of objects in the whole video.

To model relationships between a selected actor and other subjects, we can build on the concepts of hard and softattention models [55]. One way of represent the features of the actions is to first localize the correct subjects among all the objects and all the other actors (except the target actor). Then, one can use the features from the actor and the identified subjects, which we refer as hard relation graph. Alternatively, in the soft relation graph, instead of explicitly localizing the subjects, we integrate this information by implicitly learning how much they relate to the target actor. We will further demonstrate how we implement soft relation graph and hard relation graph to learn discriminative feature representation for interactions.

Hard relation graph We explicitly localize the correct objects and actors for each target actor to represent the object manipulation actions and human interaction actions. The object manipulation action is represented through linking an actor node and the other object nodes, while the human interaction action is represented through the edges between one actor nodes and the other actor nodes. Given actor node features and object node features , the object-manipulation relation feature for the target actor and the object can be represented by concatenating features of the two nodes with


where is the feature extraction function for object manipulation.

Similarly, we represent the human interaction relation feature for the target actor and the actor with


where is the feature extraction function for human interaction.

In order to learn an actor-centric action representation that is independent to the subjects, for each actor, we select the max response as final prediction. Specifically, for object manipulation action centered at the actor,



is the sigmoid function, and

is the human-object manipulation action prediction for the actor.

Similarly, the prediction for human interaction actions,


where is the human-human interaction action prediction for the actor.

Soft relation graph Instead of explicitly localizing the correct subjects for representing each action, in the soft relation module, we have the edges representing how the other actors and objects should contribute on representing the interaction relations for the target actor. We define this strongness of the relation with inverse of euclidean distance between two nodes’ features after feature transformation.

We define the feature transformation for actor features and object features with and . Given actor node features and object node features , we first transform the actor features and object features respectively and obtain , . The edge between the actor and the object is represented with


Similarly, the edge between the actor and the actor is represented with


We further normalize the edges above so that the sum of the human interaction edges and object manipulation edges are one respectively. We adopt softmax function for each actor with


where is 1…N except .

After obtaining the graph representation, the object manipulation action and human interaction action for the actor is represented with


where is 1…N except .

Afterwards, these feature representations are passed through a sigmoid activation function to create a logistic classifier that predicts object manipulation actions and human interaction actions.

4 Experiments

In this section, we first introduce the dataset and the metrics that our model is evaluated on. Next, we describe implementation details of each module in our model. Furthermore, we perform evaluation and ablation analysis, demonstrating the effectiveness of our model on integrating temporal and spatial context information. Finally we compare our model with state-of-the-art methods both quantitatively and qualitatively.

Figure 3: Performance of our proposed model and baseline

4.1 Dataset and metric

We develop our models on AVA version 2.1 benchmark [14] where action localization and recognition are performed on the middle frame of three seconds videos. The dataset contains 211k training samples and 57k validation samples. There are 80 categories for training and 60 categories with no less than 25 validation samples for testing. We report frame based mean average precision with intersection-over-union (IOU) threshold 0.5.

4.2 Implementation details

Our model is implemented on Caffe2 framework. We follow the schema as proposed in [3, 51]

to pre-train our video backbone model. We use the same 2D architecture as ResNet-50 and pretrain it on the ImageNet dataset 

[40]. The model is then inflated into 3D ConvNet as proposed in [3] (I3D), and pretrained on Kinetics dataset [3]. We augment our backbone model with non-local operations [51] after Res2, Res3, and Res4 blocks. We further fine tune it end-to-end with our proposed spatio-temporal context model. The input of our video backbone model is 36 frames from 3 seconds video with 12 fps. The images are first scaled to 272 272, and randomly cropped to 256 256.

For region proposal model, we use Mask-RCNN [16] with a ResNet-50 backbone. Instead of predicting category for each region of interest proposal, we predict for person category and potentially interacted object as a category. The region proposal model is pretrained on COCO dataset [30] and further fine tuned on AVA dataset. We use 0.5 as threshold for object bounding boxes and 0.9 for person bounding boxes.

We trained our model on 8-GPU machine where each GPU has 3 video clips as mini-batch. The total bath size is 24. We freeze parameters in batch normalization layers during training and apply drop out layer before final layer. We use drop out rate 0.3. We trained the first 90K iteration with learning rate 0.00125 and further train 10K iterations with learning rate 0.000125.

For the tracking module, we use a ResNet-50 architecture for appearance feature encoding and triplet loss [43] to learn representative appearance features for tracking actors in the video. The model takes three images as input where two of them are the cropped images of the same actor at different time (ranging from 0.02s to 10s) and the third is the cropped area of a different actor sampled from the same period. The feature output dimension is 128, and we use L2 distance as similarity metric. The model is fine tuned from ImageNet pretrained weights and we trained 100K iterations with batch size 64. While tracking, we search the region of interest proposals with larger than 0.5 IOU overlapping for each actor in the consecutive frames, and link the bounding box which minimize the appearance distance.

4.3 Evaluation and ablation analysis

Model mAP
Baseline 16.7
Person similarity graph on ROIs 20.1
Object similarity graph on ROIs 20.3
Actor tubelets model 21.1
Actor tubelets + hard relation graph module 21.5
Actor tubelets + soft relation graph module 22.2
Table 1: We perform ablation performance evaluation on each component of our model on AVA valdation set.
Model Human pose Object manipulation Human interaction
Baseline 35.7 8.9 16.9
Person similarity graph on ROIs 39.1 12.1 20.1
Object similarity graph on ROIs 39.3 13.0 20.0
Actor tubelets model 40.6 13.4 20.9
Actor tubelets + hard relation graph module 41.0 13.2 22.2
Actor tubelets + soft relation graph module 41.9 14.3 22.0
Table 2: We perform ablation performance evaluation on human pose, human-object manipulation and human-human interaction separately.

We first perform analysis on each component of our framework to understand how our proposed model helps with the action detection task. We then demonstrate quantitative and qualitative results, comparing our model to a baseline, verifying the ability of our model on detecting actions with spatio-temporal context. The results are shown in Table 1.

All our models are developed based on the non-local augmented I3D backbone. The baseline averages over temporal dimension after I3D and applies actor bounding box proposals to extract feature for each actor and recognize their actions. We achieve a mAP of 16.7 on the validation set, which is slightly improved compared to the baseline established in [46].

Figure 4: We visualize the performance of our proposed model and the baseline. In the first column, we show the video clip by sampling several frames. The second column is the middle frame where actor localization is performed. Ground-truth bounding boxes and our bounding boxes prediction are marked with green and red color. The third column shows the ground-truth action and prediction results of both our method and the baseline.

We first analyze the performance of actor tubelet component. Wang et al. [52] propose to use similarity graph and spatio-temporal graph to integrate information spatially and temporally for action recognition. We adapt their work to the domain of action detection, where actor proposals occur across the frames and the similarity graph integrates information over frames. We observe that with actor similarity graph, the model achieves a mAP of 20.1 on validation set. By explicitly connecting the same actor across frames and applying graph convolution on the top, we are able to achieve performance increase with mAP of 21.1. In addition, to the averaged score over all 60 test classes, we also show performance on three action categories: human pose, human-object manipulation and human-human interaction in Table 2. We observe that our model largely outperforms the person graph model and the baseline on human pose categories and object manipulation categories. Notably, our actor tubelet model achieves a 2x performance increase on not-in-place actions like fall down, get up, and jump. There is also a large performance gain over human manipulation actions like hand clip, eat/drink, and play instrument which further verifies the effectiveness of aggregating information on the actor base for action detection.

We then evaluate our proposed model on learning actions involving interaction. As a simple baseline, we first construct the experiment on similarity graph model with objects of interest across the whole video clip. The objects of interest are category-agnostic which are supposed to include both humans and other objects to provide information for interaction modeling. We achieve 20.3 mAP on validation set which, although shows large increase from baseline, is 0.8 mAP less compared with our actor tubelet model. Therefore, naively building graph on objects of interest does not help much. We then perform evaluation on actor tubelet with hard relation graph model and actor tubelet with soft relation graph model which achieve mAP 21.5 and 22.2 respectively as shown in Table 1. We further observe that on human pose, object manipulation and human interaction actions, our actor tubelets with soft relation graph module all increase 6.2, 5.4 and 5.1 from baseline which demonstrates the effectiveness of our proposed model for modeling both temporal dependency and interaction on action detection task.

4.4 Comparison with state-of-the-art

Model mAP
Single Frame model [14] 14.2
ACRN [46] 17.4
Our model 22.2
Table 3: We compare our model with the state-of-the-art methods on AVA validation set.

We compare our best model with the state-of-the-art models on AVA dataset. The performance is shown in the Table 3. Our model shows a mAP increase compared with existing state-of-the-art model proposed in [46]. We identify that the superior performance of our model benefits from the structural framework which explicitly models temporal dependency on actor basis and learns human-human interaction as well as human-object manipulation with relation graph. In contrast, ACRN [46] models relation by considering every pixel in the frame as the proxy which is noisier. We also visualize per-class mAP performance comparing our actor tubelet with soft relation graph model and the baseline in Figure 3. According to our observation, the largest increase achieves on categories drive, play musical instrument and hand clap which are actions requiring learning long term temporal dependency and relation with objects.

4.5 Qualitative Results Analysis

In order to qualitatively evaluate our model, we seek to verify its ability to capture temporal information and contextual relations. We visualize video clips and provide a performance comparison of several challenging examples in Figure4. We select three challenging examples, where actors are performing the actions with nontrivial temporal behavior and challenging object relationships. Our model shows stable performance across this sequence, verifying that it is capable of modeling actions that require both temporal context and involve complex interaction with the other persons and objects. We attribute this to our addition of explicit temporal reasoning using tubelets and a relation graph.

5 Conclusions

We proposed a structured model for action detection that explicitly models long-term temporal behavior as well as human-object manipulation and human-human interaction. Our model demonstrates large performance gain from current state-of-the-art which shows the effectiveness of method on modeling temporal dependency and reasoning interactions. More importantly, the success of our model shows the importance of integrating temporal and relation information for action detection tasks.


  • [1] S. Baker and I. Matthews. Lucas-kanade 20 years on: A unifying framework. International journal of computer vision, 56(3):221–255, 2004.
  • [2] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully-convolutional siamese networks for object tracking. In ECCV, 2016.
  • [3] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  • [4] Y.-W. Chao, Y. Liu, X. Liu, H. Zeng, and J. Deng. Learning to detect human-object interactions. In WACV, 2018.
  • [5] Y.-W. Chao, Z. Wang, Y. He, J. Wang, and J. Deng. Hico: A benchmark for recognizing human-object interactions in images. In ICCV, 2015.
  • [6] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016.
  • [7] C. Feichtenhofer, A. Pinz, and A. Zisserman. Detect to track and track to detect. In CVPR, 2017.
  • [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2010.
  • [9] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with lstm. 1999.
  • [10] R. Girshick. Fast r-cnn. In ICCV, 2015.
  • [11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • [12] G. Gkioxari, R. Girshick, P. Dollár, and K. He. Detecting and recognizing human-object interactions. CVPR, 2018.
  • [13] G. Gkioxari and J. Malik. Finding action tubes. In CVPR, 2015.
  • [14] C. Gu, C. Sun, S. Vijayanarasimhan, C. Pantofaru, D. A. Ross, G. Toderici, Y. Li, S. Ricco, R. Sukthankar, C. Schmid, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. CVPR, 2018.
  • [15] S. Gupta and J. Malik. Visual semantic role labeling. CVPR, 2016.
  • [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, 2017.
  • [17] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
  • [18] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. CVPR, 2017.
  • [19] S. Hong, T. You, S. Kwak, and B. Han.

    Online tracking by learning discriminative saliency map with convolutional neural network.

    In ICML, 2015.
  • [20] R. Hou, C. Chen, and M. Shah. Tube convolutional neural network (t-cnn) for action detection in videos. In ICCV, 2017.
  • [21] M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori. A hierarchical deep temporal model for group activity recognition. In CVPR, 2016.
  • [22] Z. Kalal, K. Mikolajczyk, J. Matas, et al. Tracking-learning-detection. IEEE transactions on pattern analysis and machine intelligence, 34(7):1409, 2012.
  • [23] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid. Action tubelet detector for spatio-temporal action localization. In ICCV, 2017.
  • [24] K. Kang, W. Ouyang, H. Li, and X. Wang. Object detection from video tubelets with convolutional neural networks. In CVPR, 2016.
  • [25] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • [26] A. Kläser, M. Marszałek, C. Schmid, and A. Zisserman. Human focused action localization in video. In ECCV, 2010.
  • [27] H. S. Koppula and A. Saxena. Anticipating human activities using object affordances for reactive robotic response. IEEE transactions on pattern analysis and machine intelligence, 38(1):14–29, 2016.
  • [28] Y. LeCun, Y. Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
  • [29] D. Li, Z. Qiu, Q. Dai, T. Yao, and T. Mei. Recurrent tubelet proposal and recognition networks for action detection. In ECCV, 2018.
  • [30] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • [31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016.
  • [32] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchical convolutional features for visual tracking. In ICCV, 2015.
  • [33] C.-Y. Ma, A. Kadav, I. Melvin, Z. Kira, G. AlRegib, and H. P. Graf. Attend and interact: Higher-order object interactions for video understanding. CVPR, 2018.
  • [34] X. Peng and C. Schmid. Multi-region two-stream r-cnn for action detection. In ECCV, 2016.
  • [35] J. Peyre, I. Laptev, C. Schmid, and J. Sivic.

    Weakly-supervised learning of visual relations.

    In ICCV, 2017.
  • [36] A. Prest, V. Ferrari, and C. Schmid. Explicit modeling of human-object interactions in realistic videos. IEEE transactions on pattern analysis and machine intelligence, 35(4):835–848, 2013.
  • [37] S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu. Learning human-object interactions by graph parsing neural networks. ECCV, 2018.
  • [38] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
  • [39] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [40] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [41] S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529, 2016.
  • [42] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In NIPS, 2017.
  • [43] F. Schroff, D. Kalenichenko, and J. Philbin.

    Facenet: A unified embedding for face recognition and clustering.

    In CVPR, 2015.
  • [44] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
  • [45] G. Singh, S. Saha, M. Sapienza, P. H. Torr, and F. Cuzzolin. Online real-time multiple spatiotemporal action localisation and prediction. In ICCV, 2017.
  • [46] C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid. Actor-centric relation network. ECCV, 2018.
  • [47] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [48] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance search for tracking. In CVPR, 2016.
  • [49] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • [50] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision, 103(1):60–79, 2013.
  • [51] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. CVPR, 2017.
  • [52] X. Wang and A. Gupta. Videos as space-time region graphs. ECCV, 2018.
  • [53] P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Learning to track for spatio-temporal action localization. In ICCV, 2015.
  • [54] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo.

    Convolutional lstm network: A machine learning approach for precipitation nowcasting.

    In NIPS, 2015.
  • [55] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
  • [56] G. Zhu, F. Porikli, and H. Li. Robust visual tracking with deep convolutional neural network based object proposals on pets. In CVPR Workshop, 2016.