Video action recognition has received significant development over the past few decades. Despite the remarkable progress made in the merits of deep learning models and newly released datasets,e.g., Kinetics , and Something-Something [10, 27], the higher-order interactions contained in actions are often ignored. Here we refer “higher-order interactions” as interactions involving more than two objects (or parts of an object). To recognize actions containing object interactions, we postulate that two general relations should be taken into consideration: 1) interaction between different objects in a single frame; and 2) transition of the same object
(or the same part of an object) across multiple frames. We denote the former relation as spatial relation, and the latter one as temporal relation. Both relations stem from long-range dependencies of different objects during interactions and are crucial to recognize actions involving multiple objects. An effective action recognition model should be able to capture both relations precisely and simultaneously. A classical approach is to utilize a 2D ConvNet to model spatial relations and then applies a recurrent neural network,e.g
., Long-short term memory (LSTM), to capture temporal relations. However, since the inputs to the LSTM are squeezed to vectors, rich spatial relation information in the original frames is lost. An alternative way is to extend 2D ConvNet to 3D ConvNet  by expanding a 2D kernel along temporal dimension. Although 3D convolution is capable of modeling both spatial and temporal relations implicitly, usually it recognizes actions in a clip after receiving all the frames.
Despite many recent works [2, 16, 40, 28, 36] exploring ways to model higher-order interactions between objects, few of them specifically build models to capture object interactions contained in video actions explicitly. Non-local network  achieved competitive results on various video action recognition tasks recently. It updates each value in a feature map based on the weighted sum of other values. Though the published non-local formalism does not explicitly contain nodes and edges, it does build a graph based on feature map implicitly, treating each value in the feature map as a graph node. Wang et al.  interprets a video as a space-time graph and adopts graph convolution to model interactions between different objects. However, there still remains some limitations, e.g
., (1) the graph cannot handle streaming videos since it needs to sample frames from an entire video to construct an adjacent matrix, and (2) building a graph for an entire video is computationally expensive. We cannot assume real-world videos are segmented already before we classify them.
To overcome the above limitations, we propose dynamic graph modules (DGM) to capture object interactions from the beginning of a video in a progressive way to recognize actions. Similar to LSTM, we maintain a hidden graph across time steps. When a new frame arrives, regions of interest in this frame are regarded as nodes and connected with nodes in the hidden graph. Once edges are constructed, messages from nodes in the new arriving frame will be passed to the hidden graph explicitly. After the message passing process, nodes in the hidden graph exchange information through edges, and then edge weights will be updated. A global aggregation function is applied to the hidden graph to recognize actions at this time step. When the next frame arrives, we repeat the above steps. The dynamic structure of the hidden graph captures spatial relation contained in each arrival frame. The structure also evolves in temporal domain capturing temporal relation across multiple frames. To fully exploit diverse relations between different objects, we propose two instantiations of our graph modules: visual graph and location graph. The visual graph is built based on the visual similarity of regions of interest to link the same or similar objects and model their relations. The location graph is built on locations/coordinates of regions of interest. Spatially overlapped or close objects are connected in the location graph. Our model is capable of recognizing actions with a few starting frames. As more frames get in, the accuracy of our model increases steadily. Our graph module is plug-and-play and can be combined with any 2D or 3D backbone ConvNet designed for action recognition and boosts its performance.
To demonstrate the effectiveness of our dynamic graph module in improving recognition performance of the backbone network, we conduct our experiments on a large human-object interactions dataset: Something-Something [10, 27]. This dataset is challenging as most actions in these videos are interactions between objects in both spatial and temporal domains. We devise a streaming version model to test the ability of our graph module to recognize actions in streaming videos and the experiment results demonstrate that our graph module can generate action classes in a progressive manner without considering all frames from an entire video. We also provide a static model in which we apply our graph module to boost the recognition accuracy of the backbone ConvNet. The experiment results exhibit that our graph module can help improve the performance of the backbone ConvNet.
Our main contributions include: (i) A dynamic graph module to tackle the video action recognition problem by modeling higher-order interactions between objects in both spatial and temporal domains; (ii) Two instantiations of the graph module, visual graph based on object appearance information and location graph based on object coordinate information, to recognize actions in streaming videos; (iii) State-of-the-art performance on a challenging video action recognition dataset.
The rest of the paper is organized as follows: Sec. 2 summarizes related works in action recognition and graph networks; Sec. 3 describes our dynamic graph module; Sec. 4 describes a full model combining our graph module with a 3D ConvNet model; Sec. 5 describes configurations of experiments and evaluates the performance; Sec. 6 is the ablation study of our model; Sec. 7 presents the conclusion.
2 Related Works
Video action recognition with deep learning.Convolutional neural network models have gained great success in solving image-related problem. Inspired by such success, many works have applied convolutional networks to tackle video action recognition problems [17, 31, 34, 35, 6, 38, 37]. Karpathy et al.  explored various approaches of fusing RGB frames in temporal domain. Simonyan et al.  devised a two-stream model to fuse RGB features and optical flow features to recognize actions. Tran et al.  applied a 3D kernel to convolve a sequence of frames in spatio-temporal domain. However, the proposed C3D network is shallow and only capable of processing very short videos. The later work  extended the 3D kernel to deep 2D convolutional networks, e.g., ResNet , and achieved better performance. To overcome the difficulties in training deep 3D ConvNets, 
proposed inflated 3D convolutional networks (I3D) which utilizes parameters in 2D ConvNets pre-trained on ImageNet. Wang et al.  added a non-local layer to 3D ConvNets to compute a global weighted sum for value in feature maps and further improved the performance of I3D  network. The work  proposed temporal segment network (TSN) and claimed that due to redundant information contained in adjacent frames, sparsely sampled frames can also achieve relatively good performance. Most recently, Zhou et al.  showed that the order of frames is crucial for correct recognition since some videos require reasoning in temporal domain. Zolfaghari et al. 
proposed an online video understanding system which combines a 2D ConvNet and a 3D ConvNet sequentially. An alternative way to model the temporal relation across multiple frames is using recurrent neural networks to fuse features extracted from a sequence of static RGB frames. Donahueet al.  applied LSTM to fuse features extracted by 2D ConvNet. More works can be found in [22, 24, 44, 17, 26]. However, one limitation of recurrent neural networks is that the input at each time step has to be a feature in which the rich spatial information conveyed in original frames is lost.
Graphical model / Graph neural networks. In recent literatures, some approaches combine graphical models with deep neural networks, denoted as graph neural network (GNN). One of the most intuitive ways is to adopt features extracted from CNN to define potential functions and build a graphical model. Some works [9, 30] extend convolution operation which operates on Euclidean grid data structures, e.g., image pixel arrays, to non-Euclidean structured data, e.g., graphs, and denote such extended convolutional network as graph convolutional network (GCN). Graph neural network was first introduced in  and then further developed by . 
added gates to the graph during message passing process, giving GNN the capability of handling sequence input. On the merits of inherent ability to model relations among structural data, GNN has been applied to various fields, including computer vision field[33, 32, 4, 8, 42, 43, 28]. However, few works have successfully utilized graph neural network to video action recognition task. Wang et al.  is the first to represent videos as space-time graphs for video action recognition. In the work , a similarity graph and a spatial-temporal graph are built to capture long term dependencies among objects in multiple frames. We refer readers to review literature [5, 3, 11] for a comprehensive survey of models and applications of GNNs.
Similar to , we devise a dynamic graph module based on region proposals in video frames and utilize the graph structure to model relations between interactive objects in/across multiple frames. However, distinct from previous works, our work builds graphs in both spatial and temporal domains dynamically at each time step. We also add an explicit message-passing process to propagate interactions among objects. Our model can classify actions in an incremental manner which endows our model with the ability to process partially observed videos, e.g., video streams.
3 Dynamic Graph Modules
Definition and Notations. We denote a video as where represents the feature map of the -th frame extracted by a 2D or a 3D ConvNet. For each feature map, we keep its top- region proposals generated by a Region Proposal Network (RPN) , and denote the set of proposals as , where the superscript denotes the frame index and the subscript indexes proposals in the current frame. For each proposal, we extract its feature and coordinates (top-left and bottom-right). By analogy with the hidden state in LSTM, we maintain a hidden graph at each time step for processing the video, where initializes the hidden graph. We define a hidden graph as , where denotes the set of nodes and denotes the set of weighted edges. Here, we assume that the graph is fully-connected and includes self-connections at each node. Each node in the hidden graph has a feature vector and a pair of (virtual) coordinates (top-left, bottom-right). For simplicity, we also use to denote the feature of the -th node in the hidden graph and use to denote the coordinate of this node.
Graph Module Overview. In Fig. 2, we provide an unrolled version of our dynamic graph module where we omit the backbone network and RPN for simplicity. During the initialization, we average pool all proposals in the first feature map as an initial context vector to warm up our graph module. For each of the following feature maps, proposals are fed into the graph module to update the structure of the hidden graph via an explicit message passing process. We design two types of hidden graphs, visual graph and location graph, based on two different dynamic updating strategies which will be elaborated in Sec. 3.1 and Sec. 3.2. At each time step, the hidden graph contains both visual features and interaction information of different regions accumulated in all previous time steps. For classifying an action, we apply a global aggregation function to select a group of the most relevant and discriminative regions. Attention mechanism [1, 41] acts as such a function to aggregate nodes from the hidden graph to make a recognition at each time step. More details are provided in subsection 3.3.
Region Proposal Network. Region Proposal Network (RPN)  is applied to propose bounding boxes of objects on each video frame. Specifically, we adopt a RPN with ResNet-50 pre-trained on Microsoft COCO dataset . Note that bounding boxes proposed by the RPN are class-agnostic. For each frame, we only keep the top- boxes with their confidence sorted in the descending order. Then, we apply RoIAlign  to extract features for the top- proposals. Finally, we feed the RoIAlign features and proposal coordinates into our graph module.
3.1 Visual Graph
Our visual graph aims to link objects with similar appearances and is built based on proposal features. The graph building process is illustrated in Fig. 3.
We use the average feature of top- proposals at time step to initialize the features of all nodes in the hidden graph. At time step , we measure the pair-wise visual similarity between the proposals in the -th feature map and the nodes in the hidden graph. The visual similarity between the -th proposal in -th feature map and -th node in the hidden graph is defined as:
where both and
are linear transformations anddenotes transpose. Inspired by , we apply to normalize the weights of edges connecting the -th node in the hidden graph and all proposals in the -th feature map, so that we have:
Each node in the hidden graph incorporates information from all proposals of the -th feature map gated by . Therefore, the total amount of inflow information gathered from the -th feature map to node is:
An intuitive explanation is that each node in the hidden graph looks for the most visually similar proposals and establishes a connection based on the similarity. Subsequently, the node updates its state by absorbing the incoming information:
If a proposal and a node are more visually similar in the projected space, more information will flow from this proposal to the node. We apply to ensure that only positive information will be propagated to update the node state.
After incorporating the information from all proposals of the -th feature for all nodes, the hidden graph will update the edges. We assume that the hidden graph is fully-connected initially, including self-connections. The weights of all edges are computed as:
where is a linear transformation with learnable parameters. Eq. 5 is similar to Eq. 1, except that both and are features of nodes in the hidden graph. After the edges of the hidden graph are updated, we propagate information for each node inside the hidden graph using a strategy similar to Eqs. 2, 3, and 4. Note that for Eqs. 2 and 3, we replace with , and replace with . Then we move forward to the next time step and repeat the above process. Taking advantage of the iterative processing, our model is capable of dealing with continuous streaming videos.
3.2 Location Graph
Visual graph captures visual similarity between nodes in the hidden graph and proposals, however, this is not enough to model spatial and temporal dynamics, e.g., the displacement of objects. To capture spatial relations between each pair of proposals, we propose a location graph built upon the coordinates of proposals to link objects that are overlapped or at close positions.
The location-based relation between the -th proposal in the -th feature map and the -th node in the hidden graph at time step is defined as:
where represents the amount of Intersection-Over-Union (IoU) between the -th box in the -th feature map and the -th node in the hidden graph. Similar to , we adopt L-1 norm to normalize weights connecting the -th node in the hidden graph and all proposals in the -th feature map:
By analogy with the message passing process in visual graph, each node in the hidden graph receives messages from all connected proposals from the -th feature map:
where is a linear transformation. After message passing from all proposals to the hidden graph, we update edges in the hidden graph dynamically. We compute IoU between each pair of nodes inside the hidden graph using Eq. 10 which is similar to Eq. 6:
where and are features of nodes in the hidden graph. Note that unlike the visual graph, the building process of the location graph does not contain any learnable parameters. After the graph is built, message can be propagated by applying Eqs. 7, 8, and 9, where we replace with , and replace with another linear transformation .
Coordinates updating. One issue in building the location graph is how to decide the coordinates (bounding box) of each node in the hidden graph. Since nodes in the hidden graph incorporate information from proposals in feature map at each time step, it is unsuitable to adopt a group of fixed coordinates to represent nodes in the hidden graph. To address this problem, we propose a coordinate shifting strategy to approximate the coordinates of each node in the hidden graph.
We use the average coordinates of top- proposals at time step to initialize the coordinates of all nodes in the hidden graph. At time step , suppose the top-left and bottom-right coordinate of the -th node in hidden graph is , and the coordinates of the -th proposal in the -th feature map are . The normalized weight (IoU) between the -th node in hidden graph and the -th proposal in the -th feature map is . The larger weight, the more information will flow from the -th proposal to the -th node. Therefore, the coordinates of the -th node will shift more towards the position of the -th proposal. The target position of the -th node after message passing is the weighted average positions of all proposals in the -th feature map connected to the -th node. Formally, this position is computed as:
Following the shifting strategy, coordinates attached to nodes in the hidden graph will update dynamically according to connected proposals at each time step.
3.3 Attention on Graph
At each time step, the hidden graph contains accumulated information from all preceding time steps. The recognition decision is generated based on the state of hidden graph. We need an aggregation function to aggregate information from all nodes in the hidden graph. As suggested in , such function should be invariant to permutations of all nodes, e.g., mean, maximum, and summation. However, one obvious deficiency of these primitive functions is that they cannot assign different importance to different nodes. In video action recognition, each proposal corresponds to a certain area and not all proposals contribute equally to the recognition result. Hence, the above aggregation functions are not proper in this task.
Attention mechanism was first proposed in  and it takes a weighted average of all candidates based on a query . We add a virtual node to summarize the hidden graph at each time step. The feature of this virtual node serves two purposes: one is to recognize actions at the current time step, and another is to act as a query (or context) to aggregate information from the hidden graph at the next time step. Specifically, the feature of virtual node at time step is denoted as , the feature of -th node in hidden graph at time step is denoted as , then the feature of virtual node at time step , donated as , is computed as:
where and are linear transformations. Note that at the first time step
, the feature of the virtual node is just the average feature of all proposals in the first feature map. Once the feature of virtual node is computed, we can conduct action recognition by passing the feature into a multi-layer perceptron.
4 Full Model for Action Recognition
In this section, we introduce the full model combining our dynamic graph module with a backbone convolutional network. We introduce two versions of our models: streaming version and static version, as illustrated in Fig. 4. The former version suits for handling streaming videos and the latter one can achieve better performance.
Streaming Version. Given a video clip (around 5 second), our model first randomly samples 32 frames from it. The sampled frames are input into a backbone convolutional neural network. In our case, we apply a 3D ConvNet . The output of the backbone is a sequence of 3D feature maps with the shape of where indicates the sequence dimension in temporal domain. Meanwhile, we apply a region proposal network (RPN)  to extract proposals for each sampled frame. Next, we conduct RoI align  on the sequence of feature maps combined with proposal boxes proposed by the RPN. We build our graph module dynamically upon a sequence of RoI proposals from the feature maps. We maintain a “hidden graph” which evolves along temporal dimension and generates a recognition result at each time step except for . This process is in analogy with LSTM. However, our model retains richer spatial information as the input to our graph module is a group of RoI features and coordinates instead of just a vector.
Static Version. To achieve better recognition accuracy, it is beneficial to utilize all information contained in a video. We provide a static version of our model in which our graph module can boost the performance of backbone network. We sample frames from an entire video and input all sampled frame into both the backbone 3D ConvNet and a RPN. We average pool the features output by the 3D ConvNet from to . Different from the streaming version where we conduct recognition using features generated by the graph module at each time step, we only select the graph module feature at the last time step whose size is . Then, we concatenate this feature with the feature from 3D ConvNet to form a new feature whose shape is . The concatenated feature is forwarded into a multi-perception to recognize actions.
Dataset. We evaluate our dynamic graph module on a large human-object interaction datasets: Something-Something [10, 27]. Distinct from other action recognition datasets, actions defined in Something-Something dataset involve higher-order interactions among multiple (parts of) objects. The Something-Something-V1  dataset contains more than 100K short videos and Something-Something-V2  contains around 220K videos. The average video duration is about 3 to 6 seconds. There are 174 total action classes and each video corresponds to exact one action. For both V1 and V2 dataset, we follow the official split to train and test our model.
Since all videos in Something-Something dataset are single-labeled, we adopt recognition accuracy as our evaluation metrics.
Compared methods. To verify that our dynamic graph module is capable of modeling higher-order interactions between objects, we design a baseline model: at each time step, instead of forwarding top- proposals to the graph module, we average pool these proposals to get a vector and then input it to a LSTM. We compare our streaming version model with the baseline as both of them can handle streaming videos. We also compare our static version model with some recent works [45, 39] to verify that our graph module can help improve the performance of existing 3D ConvNet for action recognition.
5.1 Implementation Details
Training. We first train our backbone 3D model [6, 38] on Kinectis dataset and then fine-tune it on Something-Something dataset. We randomly sample 32 frames and extract visual features via the backbone 3D ConvNet. Following , input frames are randomly scaled with shorter side resized to [256, 320]. Then we randomly crop out an area of and randomly flip frames horizontally before passing them to the backbone model. The Dropout 
before classification layer in the backbone model is set to 0.5. We train our backbone model with batch size of 24. We set the initial learning rate to 0.00125. We apply stochastic gradient descent (SGD) optimizer and set momentum to 0.9 and weight decay to 0.0001. We adopt cross-entropy loss during our training.
Next, we describe how we train our full model with dynamic graph module. For each input frame, we propose RoI proposals using RPN with ResNet-50 pre-trained on Microsoft COCO. We project proposal coordinates from the input frames back to the feature maps generated by the penultimate convolutional block of 3D backbone. Since 32 input frames are reduced to 8 feature maps in temporal domain, we select 8 input frames (i.e., 0-th, 4-th, 8-th, …) to match the 8 feature maps 111The 3D ConvNet can be replaced by any other 2D ConvNet and the number of feature maps and input frames can be the same.. We apply RoIAlign  with the same configuration in  to extract features for each proposal. We fix the backbone 3D ConvNet and only train our graph module and classification layer. We adopt the same learning strategy as the fine-tuning of the backbone.
Inference. We uniformly sample 32 frames and rescale them with the shorter side to 256. Then we center crop each frame to . The other configurations are kept the same as training.
5.2 Results of Streaming Version Model
Videos in Something-Something dataset usually contain two to three objects including a person. We keep the top 20 region proposals for each frame. We plot a part of our results in Fig. 5.
The accuracy of the baseline model is significantly low than any one of our graph module, indicating that the combination of average pooling over proposals and LSTM fails to capture interactions between objects. One possible explanation is that the average pooling operation discards spatial relation contained in proposals and the only temporal relation modeled by LSTM is insufficient to capture interactions. On the contrary, as our graph module adopts a graphical structure to keep both spatial and temporal relations among proposals, it can model the complex interactions between objects.
In the three graph modules, we notice that the joint graphs outperform the visual graph, and the visual graph outperforms the location graph. That is possibly because the visual graph contains more parameters than the location graph, which gives the visual graph more powerful modeling ability. We also note that the joint graph improves the accuracy slightly compared with the visual graph. It indicates that the interactions captured by the visual graph and location graph have a large range of overlap. In other words, the graph module structure intrinsically has the ability to model interactions regardless of any specific instantiation.
The accuracy of all three graph modules increases steadily as the number of frames increases. When it comes to 7-th feature map, the accuracy has already achieved a high level compared to the last feature map. It demonstrates that our graph module has the ability to recognize actions in streaming videos even if only parts of frames are forwarded into the module.
5.3 Results of Static Version Model
We compare our static version model with some recent works [45, 21, 39] as shown in Table 1. The evaluation is performed on validation set. For Something-Something V1 dataset, the backbone 3D ConvNet has achieved 45.6% in top-1 accuracy and 76.0% in top-5 accuracy. By adding our three types of dynamic graph modules to the backbone, the performance improves more than 1%. For Something-Something V2 dataset, our graph module also boosts the performance of the backbone around 1% for top-1 accuracy.
Note that compared with the visual graph or the location graph, the improvement brought by the joint graphs is not significantly . It proves that the interactions learned by the two proposed graph modules are similar. Therefore, we can draw the conclusion that the modeling ability of the graph module is independent of any concrete instantiation.
|2-Stream TRN ||42.0||-||55.5||83.1|
|Space-Time Graphs ||46.1||76.8||-||-|
6 Ablation Study
In this section, we examine several factors that could have effects on experiment results, e.g., the number of proposals, the number of nodes in hidden graph. All ablation studies are conducted on Something-Something V1 dataset.
6.1 Number of Proposals
We set RPN to propose different numbers of proposals to analyze how it affects the performance of our model. Due to the sparsity of objects in Something-Something videos, we set the number of proposals to 5, 10, 20, 50 separately, and report results in Table 6. Both average accuracy and last time step accuracy increase as the number of proposals increases, and achieve the best when the number of proposals is 20. Then the accuracy drops with the number of proposals increases continually. Both visual graph and accuracy graph have a similar trend. It implies when proposals cover the entire video area, we can get the best performance. Less proposals loss necessary information contained in frames, while more proposals provide redundant information.
6.2 Number of Nodes in Hidden Graph
One advantage of our graph module is that it allows us to configure different numbers of nodes in the hidden graph. We keep the number of RPN proposals fixed to 20 and alter the number of nodes in hidden graph to be 10, 20 and 40 separately. The results are reported in Table 6. For both visual and location graph, only when the number of nodes in hidden graph equals to the number of proposals, we can get the best performance. Interestingly, fewer graph nodes deteriorate the performance more seriously than more graph nodes. One possible explanation is a graph with fewer nodes compresses relations conveyed by proposals and causes information loss, while a graph with more nodes inserts irrelevant relations to the group of proposals and results in disrupting original information contained in videos. Keeping an equal amount of proposals and nodes in hidden graph ensures that information precisely flows from proposals into the hidden graph.
We propose a novel dynamic graph module to model higher-orders interactions in video activities. We devise two instantiations of the graph module: visual graph and location graph. We also provide two versions of the full model to process streaming videos and entire videos. Experiments exhibit that our model achieves competitive results. In the future, we will extend our graph module to tackle video prediction problems.
-  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
-  P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pages 4502–4510, 2016.
-  P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
-  E. Belilovsky, M. Blaschko, J. R. Kiros, R. Urtasun, and R. Zemel. Joint embeddings of scene graphs and images. ICLR, 2017.
-  M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
J. Carreira and A. Zisserman.
Quo vadis, action recognition? a new model and the kinetics dataset.
Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 4724–4733. IEEE, 2017.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
-  V. Garcia and J. Bruna. Few-shot learning with graph neural networks. ICLR, 2018.
-  M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, volume 2, pages 729–734. IEEE, 2005.
-  R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In The IEEE International Conference on Computer Vision (ICCV), volume 1, page 3, 2017.
-  W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation networks for object detection. In Computer Vision and Pattern Recognition (CVPR), volume 2, 2018.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
-  W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
-  Y. Kim, C. Denton, L. Hoang, and A. M. Rush. Structured attention networks. arXiv preprint arXiv:1702.00887, 2017.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  M. Lee, S. Lee, S. Son, G. Park, and N. Kwak. Motion feature network: Fixed motion filter for action recognition. In European Conference on Computer Vision, pages 392–408. Springer, 2018.
-  G. Lev, G. Sadeh, B. Klein, and L. Wolf. Rnn fisher vectors for action recognition and image annotation. In European Conference on Computer Vision, pages 833–850. Springer, 2016.
-  Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. ICLR, 2016.
-  Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 166:41–50, 2018.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  C.-Y. Ma, A. Kadav, I. Melvin, Z. Kira, G. AlRegib, and H. P. Graf. Attend and interact: Higher-order object interactions for video understanding. arXiv preprint arXiv:1711.06330, 2017.
-  F. Mahdisoltani, G. Berger, W. Gharbieh, D. Fleet, and R. Memisevic. Fine-grained video classification and captioning. arXiv preprint arXiv:1804.09235, 2018.
-  S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu. Learning human-object interactions by graph parsing neural networks. In European Conference on Computer Vision, pages 407–423. Springer, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
-  L. Song, Z. Wang, M. Yu, Y. Zhang, R. Florian, and D. Gildea. Exploring graph-structured passage representation for multi-hop reading comprehension with graph neural networks. arXiv preprint arXiv:1809.02040, 2018.
-  L. Song, Y. Zhang, Z. Wang, and D. Gildea. N-ary relation extraction using graph state lstm. arXiv preprint arXiv:1808.09101, 2018.
-  D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3d: generic features for video analysis. CoRR, abs/1412.0767, 2(7):8, 2014.
-  D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017.
-  L. Wang, W. Li, W. Li, and L. Van Gool. Appearance-and-relation networks for video classification. arXiv preprint arXiv:1711.09125, 2017.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36. Springer, 2016.
-  X. Wang, R. B. Girshick, A. Gupta, and K. He. Non-local neural networks. CoRR, abs/1711.07971, 2017.
-  X. Wang and A. Gupta. Videos as space-time region graphs. arXiv preprint arXiv:1806.01810, 2018.
-  N. Watters, A. Tacchetti, T. Weber, R. Pascanu, P. Battaglia, and D. Zoran. Visual interaction networks. arXiv preprint arXiv:1706.01433, 2017.
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Show, attend and tell: Neural image caption generation with visual
International conference on machine learning, pages 2048–2057, 2015.
-  J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph r-cnn for scene graph generation. In European Conference on Computer Vision, pages 690–706. Springer, 2018.
-  T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In Computer Vision–ECCV 2018, pages 711–727. Springer, 2018.
-  J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.
-  B. Zhou, A. Andonian, and A. Torralba. Temporal relational reasoning in videos. arXiv preprint arXiv:1711.08496, 2017.
-  M. Zolfaghari, K. Singh, and T. Brox. Eco: Efficient convolutional network for online video understanding. arXiv preprint arXiv:1804.09066, 2018.