Video understanding tasks, such as action recognition, have, for the most part, treated actions and activities as monolithic events [37, 8, 86, 65]. Most models proposed have resorted to end-to-end predictions that produce a single label for a long sequence of a video [68, 71, 10, 23, 31]
and do not explicitly decompose events into a series of interactions between objects. On the other hand, image-based structured representations like scene graphs have cascaded improvements across multiple image tasks, including image captioning35, 63], visual question answering , relationship modeling  and image generation . The scene graph representation, introduced in Visual Genome , provides a scaffold that allows vision models to tackle complex inference tasks by breaking scenes into its corresponding objects and their visual relationships. However, decompositions for temporal events has not been explored much , even though representing events with structured representations could lead to more accurate and grounded action understanding.
|Dataset||Video||# videos||# action||Objects||Relationships|
|hours||categories||annotated||localized||# categories||# instances||annotated||localized||# categories||# instances|
|HACS Clips ||833||0.4K||200||-||-||-||-|
Meanwhile, in Cognitive Science and Neuroscience, it has been postulated that people segment events into consistent groups [54, 5, 6]. Furthermore, people actively encode those ongoing activities in a hierarchical part structure — a phenomenon referred to as hierarchical bias hypothesis  or event segmentation theory . Let’s consider the action of “sitting on a sofa”. The person initially starts off next to the sofa, moves in front of it, and finally sits atop it. Such decompositions can enable machines to predict future and past scene graphs with objects and relationships as an action occurs: we can predict that the person is about to sit on the sofa when we see them move in front of it. Similarly, such decomposition can also enable machines to learn from few examples: we can recognize the same action when we see a different person move towards a different chair. While that was a relatively simple decomposition, other events like “playing football”, with its multiple rules and actors, can involve multifaceted decompositions. So while such decompositions can provide the scaffolds to improve vision models, how is it possible to correctly create representative hierarchies for a wide variety of complex actions?
In this paper, we introduce Action Genome, a representation that decomposes actions into spatio-temporal scene graphs. Object detection faced a similar challenge of large variation within any object category. So, just as progress in 2D perception was catalyzed by taxonomies , partonomies , and ontologies [78, 42], we aim to improve temporal understanding with Action Genome’s partonomy. Going back to the example of “person sitting on a sofa”, Action Genome breaks down such actions by annotating frames within that action with scene graphs. The graphs captures both the objects, person and sofa, and how their relationships evolve as the actions progress from person - next to - sofa to person - in front of - sofa to finally person - sitting on - sofa. Built upon Charades , Action Genome provides K object bounding boxes with M relationships across K video frames with action categories.
. Action-object couplets refer to transitive actions performed on objects (e.g. “moving a chair” or “throwing a ball”) and intransitive self-actions (e.g. “moving towards the sofa”). Action Genome’s dynamic scene graph representations capture both such types of events and as such, represent the prototypical unit. With this representation, we enable the study for tasks such as spatio-temporal scene graph prediction — a task where we estimate the decomposition of action dynamics given a video. We can even improve existing tasks like action recognition and few-shot action detection by jointly studying how those actions change visual relationships between objects in scene graphs.
To demonstrate the utility of Action Genome’s event decomposition, we introduce a method that extends to a state-of-the-art action recognition model  by incorporating spatio-temporal scene graphs as feature banks that can be used to both predict the action as well as the objects and relationships involved. First, we demonstrate that predicting scene graphs can benefit the popular task of action recognition by improving the state-of-the-art on the Charades dataset  from to and to when using oracle scene graphs. Second, we show that the compositional understanding of actions induces better generalization by showcasing few-shot action recognition experiments, achieving mAP using as few as
training examples. Third, we introduce the task of spatio-temporal scene graph prediction and benchmark existing scene graph models with new evaluation metrics designed specifically for videos. With a better understanding of the dynamics of human-object interactions via spatio-temporal scene graphs, we aim to inspire a new line of research in more decomposable and generalizable action understanding.
2 Related work
We derive inspiration from Cognitive Science, compare our representation with static scene graphs, and survey methods in action recognition and few-shot prediction.
Cognitive science. Early work in Cognitive Science provides evidence for the regularities with which people identify event boundaries [54, 5, 6]. Remarkably, people consistently, both within and between subjects, carve out video streams into events, actions, and activities [82, 11, 28]. Such findings hint that it is possible to predict when actions begin and end, and have inspired hundreds of Computer Vision datasets, models, and algorithms to study tasks like action recognition [70, 81, 19, 80, 79, 36]. Subsequent Cognitive and Neuroscience research, using the same paradigm, has also shown that event categories form partonomies [59, 82, 28]. However, Computer Vision has done little work in explicitly representing the hierarchical structures of actions , even though understanding event partonomies can improve tasks like action recognition.
Action recognition in videos. Many research projects have tackled the task of action recognition. A major line of work has focused on developing powerful backbone models to extract useful representations from videos [68, 71, 10, 23, 31]. Pre-trained on large-scale databases for action classification [9, 8], these backbone models serve as cornerstones for downstream video tasks and action recognition on other datasets. To assist more complicated action understanding, another growing set of research explores structural information in videos including temporal ordering [87, 50], object localization [73, 25, 75, 4, 52], and even implicit interactions between objects [52, 4]. In our work, we contrast against these methods by explicitly using a structured decomposition of actions into objects and relationships.
Table 1 lists some of the most popular datasets used for action recognition. One major trend of video datasets is providing considerably large amount of video clips with single action labels [8, 86, 9]. Although these databases have driven the progress of video feature representation for many downstream tasks, the provided annotations treat actions as monolithic events, and do not study how objects and their relationships change during actions/activities. In the mean time, other databases have provided more varieties of annotations: AVA  localizes the actors of actions, Charades  contains multiple actions happening at the same time, EPIC-Kitchen  localizes the interacted objects in ego-centric kitchen videos, DALY  provides object bounding boxes and upper body poses for daily activities. Still, scene graph, as a comprehensive structural abstraction of images, has not yet been studied in any large-scale video database as a potential representation for action recognition. In this work, we present Action Genome, the first large-scale database to jointly boost research in scene graphs and action understanding. Compared to existing datasets, we provide orders of magnitude more object and relationship labels grounded in actions.
Scene graph prediction. Scene graphs are a formal representation for image information [35, 42] in a form of a graph, which is widely used in knowledge bases [27, 13, 88]. Each scene graph encodes objects as nodes connected together by pairwise relationships as edges. Scene graphs have led to many state of the art models in image captioning , image retrieval [35, 63], visual question answering , relationship modeling , and image generation . Given its versatile utility, the task of scene graph prediction has resulted in a series of publications [42, 14, 48, 45, 47, 58, 76, 84, 77, 30]
that have explored reinforcement learning, structured prediction [39, 16, 69], utilizing object attributes [20, 60], sequential prediction , few-shot prediction [12, 17], and graph-based [76, 46, 77] approaches. However, all of these approaches have restricted their application to static images and have not modelled visual concepts spatio-temporally.
The few-shot literature is broadly divided into two main frameworks. The first strategy learns a classifier for a set of frequent categories and then uses them to learn the few-shot categories[21, 22, 57]. For example, ZSL uses attributes of actions to enable few-shot . The second strategy learns invariances or decompositions that enable few-shot classification [38, 89, 7, 18]. OSS and TARN propose a measurement of similarity or distance measure between video pairs [38, 7], CMN encodes uses a multi-saliency algorithm to encode videos 
, and ProtoGAN creates a prototype vector for each class. Our framework resembles the first strategy because we use the object and visual relationship representations learned using the frequent actions to identify few-shot actions.
3 Action Genome
|looking at||in front of||carrying||covered by|
|not looking at||behind||drinking from||eating|
|unsure||on the side of||have it on the back||holding|
|above||leaning on||lying on|
|beneath||not contacting||sitting on|
Inspired from Cognitive Science, we decompose events into prototypical action-object units [83, 62, 43]. Each action in Action Genome is representated as changes to objects and their pairwise interactions with the actor/person performing the action. We derive our representation as a temporally changing version of Visual Genome’s scene graphs . However, unlike Visual Genome, who’s goal was to densely represent a scene with objects and visual relationships, Action Genome’s goals is to decompose actions and as such, focuses on annotating only those segments of the video where the action occurs and only those objects that are involved in the action.
Annotation framework. Action Genome is built upon the Charades dataset , which contains action classes, of which are human-object activities. In Charades, there are multiple actions that might be occurring at the same time. We do not annotate every single frame in a video; it would be redundant as the changes between objects and relationships occur at longer time scales.
Figure 2 visualizes the pipeline of our annotation. We uniformly sample frames to annotate across the range of each action interval. With this action-oriented sampling strategy, we provide more labels where more actions occur. For instance, in the example, actions “sitting on a chair” and “drinking from a cup” occur together and therefore, result in more annotated frames, from each action. When annotating each sampled frame, the annotators hired were prompted with action labels and clips of the neighboring video frames for context. The annotators first draw bounding boxes around the objects involved in these actions, then choose the relationship labels from the label set. The clips are used to disambiguate between the objects that are actually involved in an action when multiple instances of a given category is present. For example, if multiple “cups” are present, the context disambiguates which “cup” to annotate for the action “drinking from a cup”.
Action Genome contains three different kinds of human-object relationships: attention, spatial and contact relationships (see Table 2). Attention relationships indicate if a person is looking at an object or not, and serve as indicators for which object the person is or will interacting with. Spatial relationships describe where objects are relative to one another. Contact relationships describe the different ways the person is contacting an object. A change in contact often indicates the occurrence of an actions: for example, changing from person - not contacting - book to person - holding - book may show an action of “picking up a book”.
It is worth noting that while Charades provides an injective mapping from each action to a verb, it is different from the relationship labels we provide. Charades’ verbs are clip-level labels, such as “awaken”, while we decompose them into frame-level human-object relationships, such as a sequence of person - lying on - bed, person - sitting on - bed and person - not contacting - bed.
Database statistics. Action Genome provides frame-level scene graph labels for the components of each action. Overall, we provide annotations for frames with a total of bounding boxes of object classes (excluding “person”), and instances of relationship classes. Figure 3 visualizes the log-distribution of object and relationship categories in the dataset. Like most concepts in vision, some objects (e.g. table and chair) and relationships (e.g. in front of and not looking at) occur frequently while others (e.g. twisting and doorknob) only occur a handful of times. However, even with such a distribution, almost all objects have at least k instances and every relationship as at least K instances.
Additionally, Figure 4 visualizes how frequently objects occur in which relationships. We see that most objects are pretty evenly involved in all three types of relationships. Unlike Visual Genome, where dataset bias provides a strong baseline for predicting relationships given the object categories, Action Genome does not suffer the same bias.
We validate the utility of Action Genome’s action decomposition by studying the effect of simultaneously learning spatio-temporal scene graphs while learning to recognize actions. We propose a method, named Scene Graph Feature Banks (SGFB), to incorporate spatio-temporal scene graphs into action recognition. Our method is inspired by recent work in computer vision that uses the information “banks” [44, 1, 75]. Information banks are feature representations that have been used to represent, for example, object categories that occur in the video , or even include where the objects are . Our model is most directly related to the recent long-term feature banks , which accumulates features of a long video as a fixed size representation for action recognition.
Overall, our SGFB model contains two components: the first component generates spatio-temporal scene graphs while the second component encodes the graphs to predict action labels. Given a video sequence , the aim of traditional multi-class action recognition is to assign multiple action labels to this video. Here, represents the video sequence made up of image frames . SGFB generates a spatio-temporal scene graph for every frame in the given video sequence. The scene graphs are encoded to formulate a spatio-temporal scene graph feature bank for the final task of action recognition. We describe the scene graph prediction and the scene graph feature bank components in more detail below. See Figure 5 for a high-level visualization of the model’s forward pass.
4.1 Scene graph prediction
Previous research has proposed plenty of methods for predicting scene graphs on static images [84, 51, 77, 47, 76, 85]. We employ a state-of-the-art scene graph prediction method as the first step of our method. Given a video sequence , the scene graph predictor generates all the objects and connects each object with their relationships with the actor in each frame, i.e. . On each frame, the scene graph consists of a set of objects that a person is interacting with and a set of relationships . Here denotes the th relationship between the person with the object . Note that there can be multiple relationships between the person and each object, including attention, spatial, and contact relationships. Besides the graph labels, the scene graph predictor also outputs confidence scores for all predicted objects: and relationships: . We have experimented with various choices of and benchmark their performance on Action Genome in Section 5.3.
4.2 Scene graph feature banks
After obtaining the scene graph on each frame, we formulate a feature vector by aggregating the information across all the scene graphs into a feature bank. Let’s assume there are classes of objects and classes of relationships. In Action Genome, and . We first construct a confidence matrix with dimension , where each entry corresponds to an object-relationship category pair. We compure every entry of this matrix using the scores output by the scene graph predictor . . Intuitively, is a high value when is confident that there is an object in the current frame and its relationship with the actor is . We flatten the confidence matrix as the feature vector for each image.
is a sequence of scene graph features extracted from frames. We aggregate the features across the frames using methods similar to long-term feature banks , i.e. are combined with 3D CNN features
extracted from a short-term clip using feature bank operators (FBO), which can be instantiated as mean/max pooling or non-local blocks. The 3D CNN embeds short-term information into while provides contextual information, critical in modeling the dynamics of complex actions with long time span. The final aggregated feature is then used to predict action labels for the video.
Action Genome’s representation enables us to study few-shot action recognition by decomposing actions into temporally changing visual relationships between objects. It also allows us to benchmark whether understanding the decomposition helps improve performance in action recognition or scene graph prediction individually. To study these benefits afforded by Action Genome, we design three experiments: action recognition, few-shot action recognition, and finally, spatio-temporal scene graph prediction.
5.1 Action recognition on Charades
|I3D + NL [10, 72]||R101-I3D-NL||Kinetics-400||37.5|
|SlowFast+NL [23, 72]||R101-NL||Kinetics-400||42.5|
|SGFB Oracle (ours)||R101-I3D-NL||Kinetics-400||60.3|
We expect that grounding the components that compose an action — the objects and their relationships — will improve our ability to predict which actions are occurring in a video sequence. So, we evaluate the utility of Action Genome’s scene graphs on the task of action recognition.
Problem formulation. We specifically study multi-class action recognition on the Charades dataset . The Charades dataset contains crowdsourced videos with a length of seconds on average. At any frame, a person can perform multiple actions out of a nomenclature of classes. The multi-classification task provides a video sequence as input and expects multiple action labels as output. We train our SGFB model to predict Charades action labels during test time and during training, provide SGFB with spatio-temporal scene graphs as additional supervision.
Baselines. Previous work has proposed methods for multi-class action recognition and benchmarked on Charades. Recent state-of-the-art methods include applying I3D  and non-local blocks  as video feature extractors (I3D+NL), spatio-temporal region graphs (STRG) , Timeception convolutional layers (Timeception) , SlowFast networks (SlowFast) , and long-term feature banks (LFB) . All the baseline methods are pre-trained on Kinetics-400  and the input modality is RGB.
Implementation details. SGFB first predicts a scene graph on each frame, then constructs a spatio-temporal scene graph feature bank for action recognition. We use Faster R-CNN  with ResNet-101  as the backbone for region proposals and object detection. We leverage RelDN  to predict the visual relationships. Scene graph prediction is trained on Action Genome, where we follow the same train/val splits of videos as the Charades dataset. Action recognition uses the same video feature extractor, hyper-parameters, and solver schedulers as long-term feature banks (LFB)  for a fair comparison.
Results. We report performance of all models using mean average precision (mAP) on Charades validation set in Table 3. By replacing the feature banks with spatio-temporal scene graph features, we outperform the state-of-the-art LFB by mAP. Our features are smaller in size ( in SGFB versus in LFB) but concisely capture the more information for recognizing actions.
We also find that improving object detectors designed for videos can further improve action recognition results. To quantitatively demonstrate the potential of better scene graphs on action recognition, we designed an SGFB Oracle setup. The SGFB Oracle assumes that a perfect scene graph prediction method is available. The spatio-temporal scene graph feature bank therefore, directly encodes a feature vector from ground truth objects and visual relationships for the annotated frames. Feeding such feature banks into the SGFB model, we observe a significant improvement on action recognition: increase on mAP. Such a boost in performance shows the potential of Action Genome and compositional action understanding when video-based scene graph models are utilized to improve scene graph prediction. It is important to note that the performance by SGFB Oracle is not an upper bound on performance since we only utilize ground truth scene graphs for the few frames that have ground truth annotations.
5.2 Few-shot action recognition
Intuitively, predicting actions should be easier from a symbolic embedding of scene graphs than from pixels. When trained with very few examples, compositional action understanding with additional knowledge of scene graphs should outperform methods that treat actions as monolithic concept. We showcase the capability and potential of spatio-temporal scene graphs to generalize to rare actions.
|SGFB oracle (ours)||30.4||40.2||50.5|
Problem formulation. In our few-shot action recognition experiments on Charades, we split the action classes into a base set of classes and a novel set of classes. We first train a backbone feature extractor (R101-I3D-NL) on all video examples of the base classes, which is shared by the baseline LFB, our SGFB, and SGFB oracle. Next, we train each model with only examples from each novel class, where
, for 50 epochs. Finally, we evaluate the trained models on all examples of novel classes in the Charades validation set.
Results. We report few-shot experiment performance in Table 4. SGFB achieves better performance than LFB on all -shot experiments. Furthermore, if with ground truth scene graphs, SGFB Oracle shows a -shot mAP improvement. We visualize the comparison between SGFB and LFB in Figure 6. With the knowledge of spatio-temporal scene graphs, SGFB better captures action concepts involving the dynamics of objects and relationships.
5.3 Spatio-temporal scene graph prediction
Progress in image-based scene graph prediction has cascaded to improvements across multiple Computer Vision tasks, including image captioning , image retrieval [35, 63], visual question answering , relationship modeling  and image generation . In order to promote similar progress in video-based tasks, we introduce the complementary of spatio-temporal scene graph prediction. Unlike image-based scene graph prediction, which only has a single image as input, this task expects a video as input and therefore, can utilize temporal information from neighboring frames to strength its predictions. In this section, we define the task, its evaluation metrics and report benchmarked results from numerous recently proposed image-based scene graph models applied to this new task.
|Freq Prior ||45.50||45.67||44.91||45.05||43.89||45.60||43.30||44.99||33.37||34.54||32.65||33.79|
|Graph R-CNN ||23.71||23.91||23.42||23.60||21.76||23.50||21.41||23.18||15.99||16.81||15.59||16.42|
Problem formulation. The task expects as input a video sequence where represents image frames from the video. The task requires the model to generate a spatio-temporal scene graph per frame. is represented as objects with category labels and bounding box locations. represents the relationships between objects and .
Evaluation metrics. We borrow the three standard evaluation modes for image-based scene graph prediction : (i) scene graph detection (SGDET) which expects input images and predicts bounding box locations, object categories, and predicate labels, (ii) scene graph classification (SGCLS) which expects ground truth boxes and predicts object categories and predicate labels, and (iii) predicate classification (PREDCLS), which expects ground truth bounding boxes and object categories to predict predicate labels. We refer the reader to the paper that introduced these tasks for more details . We adapt these metrics for video, where the per-frame measurements are first averaged in each video as the measurement of the video, then we average video results as the final result for the test set.
Baselines. We benchmark the following recently proposed image-based scene graph models for the task of spatio-temporal scene graph prediction: VRD’s visual module (VRD) , iterative message passing (IMP) , multi-level scene description network (MSDN) , graph convolution R-CNN (Graph R-CNN) , neural motif’s frequency prior (Freq-prior) , and relationship detection network (RelDN) .
Results. To our surprise, we find that IMP, which was one of the earliest scene graph prediction models actually outperforms numerous more recently proposed methods. The most recently proposed scene graph model, RelDN marginally outperforms IMP, suggesting that modeling similarlities between object and relationship classes improve performance in our task as well. The small gap in performance between the task of PredCls and SGCls suggests that these models suffer from not being able to accurately detect the objects in the video frames. Improving object detectors designed specifically for videos could improve performance. The models were trained only using Action Genome’s data and not finetuned on Visual Genome , which contains image-based scene graphs, or on ActivityNet Captions 
, which contains dense captioning of actions in videos with natural language paragraphs. We expect that finetuning models with such datasets would result in further improvements.
6 Future work
With the rich hierarchy of events, Action Genome not only enables research on spatio-temporal scene graph prediction and compositional action recognition, but also promises various research directions. We hope future work will develop methods for the following:
Spatio-temporal action localization. The majority of spatio-temporal action localization methods [24, 25, 67, 32, 67] focus on localizing the person performing the action but ignore the objects, which are also involved in the action, that the person interacts with. Action Genome can enable research on localization of both actors and objects, formulating a more comprehensive grounded action localization task. Furthermore, other variants of this task can also be explored; for example, a weakly-supervised localization task where a model is trained with only action labels but tasked with localizing the actors and objects.
Explainable action models.
Explainable visual models is an emerging field of research. Amongst numerous techniques, saliency prediction has emerged as a key mechanism to interpret machine learning models[66, 53, 64]. Action Genome provides frame-level labels of attention in the form of objects that a the person performing the action is either looking at or interacting with. These labels can be used to further train explainable models.
We introduce Action Genome, a representation that decomposes actions into spatio-temporal scene graphs. Scene graphs explain how objects and their relationships change as an action occurs. We demonstrated the utility of Action Genome by collecting a large dataset of spatio-temporal scene graphs and used it to improve state of the art results for action recognition as well as few-shot action recognition. Finally, we benchmarked results for the new task of spatio-temporal scene graph prediction. We hope that Action Genome will inspire a new line of research in more decomposable and generalizable video understanding.
Acknowledgement. This work has been supported by Panasonic. This article solely reflects the opinions and conclusions of its authors and not Panasonic or any entity associated with Panasonic.
-  (2012) Detection bank: an object detection based video representation for multimedia event recognition. In Proceedings of the 20th ACM international conference on Multimedia, pp. 1065–1068. Cited by: §4.
-  (2016) Spice: semantic propositional image caption evaluation. In European Conference on Computer Vision, pp. 382–398. Cited by: §1, §2, §5.3.
-  (2019) Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4561–4569. Cited by: §6.
-  (2018) Object level visual reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 105–121. Cited by: §2.
-  (1951) One boy’s day; a specimen record of behavior.. Cited by: §1, §2.
-  (1955) Midwest and its children: the psychological ecology of an american town.. Cited by: §1, §2.
-  (2019) TARN: temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021. Cited by: §2.
Activitynet: a large-scale video benchmark for human activity understanding.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970. Cited by: Table 1, §1, §2, §2.
-  (2019) A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987. Cited by: Table 1, §2, §2.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §1, §2, §5.1, Table 3.
-  (1996) Events, volume 15 of the international research library of philosophy. Dartmouth Publishing, Aldershot. Cited by: §2.
-  (2019) Scene graph prediction with limited labels. arXiv preprint arXiv:1904.11622. Cited by: §2.
-  (2004) Dependency tree kernels for relation extraction. In Proceedings of the 42nd annual meeting on association for computational linguistics, pp. 423. Cited by: §2.
-  (2017) Detecting visual relationships with deep relational networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3298–3308. Cited by: §2.
-  (2018) Scaling egocentric vision: the epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–736. Cited by: Table 1, §2.
-  (2011) Discriminative models for multi-class object layout. International journal of computer vision 95 (1), pp. 1–12. Cited by: §2.
-  (2019) Visual relationships as functions: enabling few-shot scene graph prediction. arXiv preprint arXiv:1906.04876. Cited by: §2.
-  (2019) ProtoGAN: towards few shot learning for action recognition. arXiv preprint arXiv:1909.07945. Cited by: §2.
-  (2016) Daps: deep action proposals for action understanding. In European Conference on Computer Vision, pp. 768–784. Cited by: §2.
-  (2009) Describing objects by their attributes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 1778–1785. Cited by: §2.
-  (2003) A bayesian approach to unsupervised one-shot learning of object categories. In Proceedings Ninth IEEE International Conference on Computer Vision, pp. 1134–1141. Cited by: §2.
-  (2006) One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28 (4), pp. 594–611. Cited by: §2.
-  (2019) Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211. Cited by: §1, §2, §5.1, Table 3.
-  (2018) A better baseline for ava. arXiv preprint arXiv:1807.10066. Cited by: §6.
Video action transformer network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 244–253. Cited by: §2, §6.
-  (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056. Cited by: Table 1, §2.
-  (2005) Exploring various knowledge in relation extraction. In Proceedings of the 43rd annual meeting on association for computational linguistics, pp. 427–434. Cited by: §2.
-  (2006) Making sense of abstract events: building event schemas. Memory & cognition 34 (6), pp. 1221–1235. Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.1.
-  (2018) Mapping images to scene graphs with permutation-invariant structured prediction. In Advances in Neural Information Processing Systems, pp. 7211–7221. Cited by: §2.
-  (2019) Timeception for complex action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 254–263. Cited by: §1, §2, §5.1, Table 3.
-  (2018) Human centric spatio-temporal action localization. In ActivityNet Workshop on CVPR, Cited by: §6.
-  (2018) Image generation from scene graphs. arXiv preprint arXiv:1804.01622. Cited by: §1, §2, §5.3, §6.
-  (2017) Inferring and executing programs for visual reasoning. arXiv preprint arXiv:1705.03633. Cited by: §1, §2, §5.3.
-  (2015) Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3668–3678. Cited by: §1, §2, §5.3.
Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §2.
-  (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §1, §5.1.
-  (2011) One shot similarity metric learning for action recognition. In International Workshop on Similarity-Based Pattern Recognition, pp. 31–45. Cited by: §2.
-  (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pp. 109–117. Cited by: §2.
-  (2018) Referring relationships. In Computer Vision and Pattern Recognition, Cited by: §1, §2, §5.3.
-  (2017) Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pp. 706–715. Cited by: §5.3.
-  (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §1, §1, §2, §3, §5.3.
-  (2008) Segmentation in the perception and memory of events. Trends in cognitive sciences 12 (2), pp. 72–79. Cited by: Figure 1, §1, §1, §3.
Object bank: a high-level image representation for scene classification & semantic feature sparsification. In Advances in neural information processing systems, pp. 1378–1386. Cited by: §4.
-  (2017) Vip-cnn: visual phrase guided convolutional neural network. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 7244–7253. Cited by: §2.
-  (2018) Factorizable net: an efficient subgraph-based framework for scene graph generation. In European Conference on Computer Vision, pp. 346–363. Cited by: §2.
-  (2017) Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1261–1270. Cited by: §2, §4.1, §5.3, Table 5.
-  (2017) Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 4408–4417. Cited by: §2.
-  (2014) Discriminative hierarchical modeling of spatio-temporally composable human activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 812–819. Cited by: §1, §1, §2.
-  (2019) Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7083–7093. Cited by: §2.
-  (2016) Visual relationship detection with language priors. In European Conference on Computer Vision, pp. 852–869. Cited by: §4.1, §5.3, §5.3, Table 5.
-  (2018) Attend and interact: higher-order object interactions for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6790–6800. Cited by: §2.
-  (2016) Salient deconvolutional networks. In European Conference on Computer Vision, pp. 120–135. Cited by: §6.
-  (1963) The perception of causality. Routledge. Cited by: §1, §2.
-  (1976) Language and perception.. Belknap Press. Cited by: §1.
-  (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §1.
-  (2018) A generative approach to zero-shot and few-shot action recognition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 372–380. Cited by: §2.
-  (2017) Pixels to graphs by associative embedding. In Advances in Neural Information Processing Systems, pp. 2168–2177. Cited by: §2.
-  (1973) Attribution and the unit of perception of ongoing behavior.. Journal of Personality and Social Psychology 28 (1), pp. 28. Cited by: §2.
-  (2011) Relative attributes. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 503–510. Cited by: §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §5.1.
-  (2007) A computational model of event segmentation from perceptual prediction. Cognitive science 31 (4), pp. 613–643. Cited by: §1, §3.
-  (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pp. 70–80. Cited by: §1, §2, §5.3.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §6.
-  (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pp. 510–526. Cited by: Table 1, §1, §1, §1, §2, §3, §5.1.
-  (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §6.
-  (2018) Actor-centric relation network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 318–334. Cited by: §6.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. Cited by: §1, §2.
-  (2010) Auto-context and its application to high-level vision tasks and 3d brain image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (10), pp. 1744–1757. Cited by: §2.
-  (2017) Long-term temporal convolutions for action recognition. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1510–1517. Cited by: §2.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: §1, §2.
-  (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §4.2, §5.1, Table 3.
-  (2018) Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 399–417. Cited by: §2, §5.1, Table 3.
-  (2016) Human action localization with sparse spatial supervision. arXiv preprint arXiv:1605.05197. Cited by: Table 1, §2.
-  (2019) Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 284–293. Cited by: §1, §2, Figure 5, §4.2, §4, Figure 6, §5.1, §5.1, Table 3, Table 4.
-  (2017) Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. Cited by: §2, §4.1, §5.3, Table 5.
-  (2018) Graph r-cnn for scene graph generation. arXiv preprint arXiv:1808.00191. Cited by: §2, §4.1, §5.3, Table 5.
-  (2007) Introduction to a large-scale general purpose ground truth database: methodology, annotation tool and benchmarks. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 169–183. Cited by: §1.
Every moment counts: dense detailed labeling of actions in complex videos. International Journal of Computer Vision 126 (2-4), pp. 375–389. Cited by: §2.
-  (2016) End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2678–2687. Cited by: §2.
-  (2015) Beyond short snippets: deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702. Cited by: §2.
-  (2001) Human brain activity time-locked to perceptual event boundaries. Nature neuroscience 4 (6), pp. 651. Cited by: §2.
-  (2001) Perceiving, remembering, and communicating structure in events.. Journal of experimental psychology: General 130 (1), pp. 29. Cited by: Figure 1, §1, §1, §3.
-  (2017) Neural motifs: scene graph parsing with global context. arXiv preprint arXiv:1711.06640. Cited by: §2, §4.1, §5.3, Table 5.
-  (2019) Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11535–11543. Cited by: §4.1, §5.1, §5.3, Table 5.
-  (2019) Hacs: human action clips and segments dataset for recognition and temporal localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8668–8678. Cited by: Table 1, §1, §2.
-  (2018) Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 803–818. Cited by: §2.
Tree kernel-based relation extraction with context-sensitive structured parse tree information.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Cited by: §2.
-  (2018) Compound memory networks for few-shot video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 751–766. Cited by: §2.
-  (2019) Explainable video action reasoning via prior knowledge and state transitions. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 521–529. Cited by: Table 1.