Compositional Video Synthesis with Action Graphs

06/27/2020 ∙ by Amir Bar, et al. ∙ 3

Videos of actions are complex spatio-temporal signals, containing rich compositional structures. Current generative models are limited in their ability to generate examples of object configurations outside the range they were trained on. Towards this end, we introduce a generative model (AG2Vid) based on Action Graphs, a natural and convenient structure that represents the dynamics of actions between objects over time. Our AG2Vid model disentangles appearance and position features, allowing for more accurate generation. AG2Vid is evaluated on the CATER and Something-Something datasets and outperforms other baselines. Finally, we show how Action Graphs can be used for generating novel compositions of unseen actions.



There are no comments yet.


page 2

page 6

page 7

page 14

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative models for images have improved dramatically in recent years for domains such as faces Karras et al. (2019a, b), visual categories Brock et al. (2019) and even complex scenes Herzig et al. (2019a). Generating videos is a much harder task because videos contain long range spatio-temporal dependencies, many of which are created when an object or a person performs an action with other objects.

Actions are a fundamental building block of videos and a key source of their richness and complexity. Actions are compositional, evolve over time and could involve multiple objects and agents, as in the case where one player passes a ball to another. We propose to focus on the task of generating actions as an important step towards generating videos of complex scenes.

We define the task of action generation as taking a description of an action and producing a video that depicts that action. But how should actions be described? Classic work in cognitive-psychology argues that actions (and more generally events) are bounded regions of space-time and that they are composed of atomic action units Quine (1985); Zacks and Tversky (2001). In a video, multiple actions can be applied to one or more objects, changing the relationships between the object and the subject of an action over time. Based on these observations, we introduce a formalism we call an “Action Graph”, which is a graph where nodes are objects and the edges are actions specified by their start and end time. See Figure 1.

We argue that Action Graphs are an intuitive representation for describing actions and would be a natural way to provide inputs to generative models. They can be viewed as a temporal extension of scene graphs, which have proven effective at describing static scenes. The main advantage of Action Graph is their ability to describe the actions of multiple objects in a scene.

In our video generation framework, our inputs include an initial frame of the video and an Action Graph. Instead of generating the pixels directly, we propose a 2-layer pipeline, where bounding box layouts are first generated over time as intermediate representations, and pixels are then generated conditioned on the layouts. This approach disentangles motion and appearance, leading to better generalization. A key challenge in our model is how to keep track of the execution of the different actions. We propose to do this via a notion of “clocked edges” that have variables specifying action time. We develop a graph neural network 

Kipf and Welling (2016) which operates on Action Graphs and predicts an updated scene layout, thereby effectively integrating spatio-temporal information.

We apply our Action-Graph-to-Video (AG2Vid) model to two different datasets: CATER  Girdhar and Ramanan (2020) and Something-Something Goyal et al. (2017). We show that our Action Graphs not only help modeling complex events in simulation, but also help generating high quality videos in real world settings with large motion changes. More importantly, by disentangling the structure generation and appearance generation, it allows our model to generalize to unseen events.

Our contributions are thus: 1) Introducing the formalism of Action Graphs (AG) and proposing a new video synthesis task. 2) Presenting a novel action-graph-to-video (AG2Vid) model for this task. 3) Demonstrating an approach for constructing new actions out of existing atomic actions.

Figure 1: The Action Graph to Video task. Given an Action Graph, initial image and initial layout, the goal is to generate a video that is both satisfies the Action Graph constraints.

2 Related work

Conditional generation. Conditional video generation has attracted considerable interest recently, with focus on two main tasks: video prediction Battaglia et al. (2016); Kipf et al. (2018); Mathieu et al. (2015); Walker et al. (2016); Watters et al. (2017); Ye et al. (2019) and video-to-video translation Chan et al. (2019); Siarohin et al. (2019); Kim et al. (2019); Wang et al. (2019, 2018a). In prediction, the goal is to generate future video frames conditioned on few initial frames. For example, it was proposed to train predictors with GANs Goodfellow et al. (2014) to predict future pixels Mathieu et al. (2015). However, directly predicting pixels is very challenging Walker et al. (2016). Instead of pixels, researchers explored object-centric graphs and perform prediction on these Battaglia et al. (2016); Ye et al. (2019). While inspired by object-centric representations, our method is different from these works as our generation is guided by an action graph, which largely reduces the uncertainty in generation and leads to much better generalization. Our work is more closely related in spirit to work on video-to-video translation. The video-to-video translation task was first proposed in Wang et al. (2018a), where a natural video was generated from frame-wise semantic segmentation annotations. However, labeling dense pixels for each frame is very expensive, and might not even be necessary. Motivated by this, researchers have sought to perform generation conditioned on more accessible signals including audio or text Fried et al. (2019); Ginosar et al. (2019); Song et al. (2018). For example, Ginosar et al. (2019) aims to synthesize a video of a speaker given her input speech by utilizing the speech-to-gesture dynamics. Here, we propose to synthesize videos conditioned on a novel action graph, which is easy to obtain compared to semantic segmentation and a more structural representation compared to audio and text.

Scene graphs (SG). Our action graph representation is inspired by scene graphs Johnson et al. (2015, 2018)

, a structured representation that models a spatial scene, where objects are nodes and relations are edges. SGs have been widely used in various tasks including image retrieval 

Johnson et al. (2015); Schuster et al. (2015), relationship modeling Krishna et al. (2018); Raboh et al. (2020); Schroeder et al. (2019)

, and image captioning 

Xu et al. (2019). Recently, scene graphs have also been applied in image generation Herzig et al. (2018); Johnson et al. (2018), where the goal is to generate a natural image that corresponds to the high-level scene described by the input SG. With SGs, a two-stage pipeline is proposed to first generate the scene layouts and then the pixels, which inspires our work. However, since these approaches are focusing on image generation, the relations in SG are mainly defined based on spatial locations, without any temporal dynamics. In our work, the action graph encodes the spatial-temporal dynamics, where each edge is represented by one temporal action applied on the objects. By considering the temporal dynamics, we are able to generate frames that are temporally coherent and more realistic.

Action recognition. Spatio-temporal scene graphs have been explored in the field of action recognition Girdhar et al. (2019); Herzig et al. (2019b); Jain et al. (2016); Materzynska et al. (2020); Sun et al. (2018); Wang and Gupta (2018); Yan et al. (2018). For example, a space-time region graph is proposed in Wang and Gupta (2018) where object regions are taken as nodes and a GCN Kipf and Welling (2016)

is applied to perform reasoning among objects for classifying actions. Recently, it was also shown in 

Girdhar and Ramanan (2020); Ji et al. (2019); Yi et al. (2019) that the key obstacles in action recognition are the ability to capture the long-range dependencies and the ability to model compositionality of actions. Our graph reasoning algorithm is inspired by these approaches. Instead of recognition, we focus on generating realistic videos which is a very different challenge.

Figure 2: Example of a partial Action Graph execution schedule in different time-steps.

3 Action Graphs

Our goal in this work is to build a model for synthesizing videos with actions that can be manipulated in a symbolic manner. A key component in this effort is developing a semantic representation to describe the actions, together with their relations to objects in the scene. We introduce a formalism we call Action Graphs (AG) that captures all these relations. In an action graph, nodes correspond to objects, and edges correspond to directed actions operating on these objects. Objects and actions are annotated by semantic categories and actions are also annotated by the time of action.

More formally, an action graph is a tuple described as follows:

  • [leftmargin=5mm,topsep=0pt]

  • An alphabet of object categories . Categories can be compounded and include attributes. For example “Blue Cylinder” or “Large Box”.

  • An alphabet of action categories . For Example “Cover” and “’Rotate”.

  • Object nodes : A set of objects.

  • Action edges : Actions are represented as labeled directed edges between object nodes. Each edge is annotated with an action category and with the time period during which the action is performed. Formally, each edge is of the form where are object instances, is an action and are action start and end time. Thus this edge implies that object object (which has category ) performs an action over object , and that this action takes place between times and . We note that an action graph edge can directly model actions over a single object and a pair of objects. For example, “Swap the positions of objects and between time 0 and 9” is an action over two objects corresponding to edge . Some actions, such as “Rotate”, involve only one object and will therefore be specified as self-loops.

4 Action Graph to Video via Clocked Edges

We now turn to the key challenge of this paper: transforming an action graph into a video. Naturally, this transformation will be learned from data. The generation problem is defined as follows: We wish to build a generator that takes as input an action graph and outputs a video.111The generator can depend on stochastic noise as in GANs, but following most recent work on conditional generation, we consider deterministic maps. We will also allow the option of conditioning on the first frame of the video, so we can preserve the visual attributes of the given objects.222Using the first frame can be avoided by using an SG2Image model Ashual and Wolf (2019); Herzig et al. (2019a); Johnson et al. (2018) for generating the first frame. We will learn from training data that consists of pairs of actions graphs and videos corresponding to these graphs.

There are multiple unique challenges in generating a video from an action graph that cannot be addressed using current generation methods. First, each action in the graph unfolds over time, so the model needs to “keep track” of the progress of actions rather than just condition on previous frames as commonly done. Second, action graphs contain multiple concurrent actions and the generation process needs to combine them in a realistic way. Third, one has to design training losses that capture the spatio-temporal video structure to ensure that the semantics of the action graph is captured.

Clocked Edges.

As discussed above, we need a mechanism for monitoring the progress of action execution during the video. A natural approach is to to keep a “clock” for each action, for keeping track of action progress as the video progresses. See Fig. 2 for an illustration. Formally, we keep a clocked version of the graph where each edge is augmented with a temporal state. Let be an edge in the action graph . We define the progress of at time to be , and clip to . Thus, if the action has not started yet, if it is currently being executed, and if it has completed. We then create an augmented version of the edge at time given by . We define to be the action graph at time . To summarize the above, we take the original graph and turn it into a sequence of actions graphs , where is the last time-step. Each action edge in the graph now has its unique clock for its execution. This facilitates both a timely execution of actions and coordination between actions.

Figure 3: Our AG2Vid Model. The Action Graph at time describes the execution stage of each action at time . Together with the previous layout and frame () it is used to generate the next frame.

4.1 The AG2Vid Model

Next, we describe our proposed action graph to video model (AG2Vid). Figure 3 provides a high-level illustration of our model. We assume that frames are generated sequentially and let denote the generating distribution of the frames given the input.

Of key importance to our generation process is the layout of the objects in every frame, namely the set of bounding boxes corresponding to the objects at frame . These describe the coarse level motion trajectories of the objects. The rationale of our generation process is that the action graph is used to produce the layouts and then these in turn can be used to produce the frame pixels. Formally, we let

denote a set of vectors, one per bounding box. Each such vector contains the four bounding box coordinates, as well as a descriptor vector for this bounding box (the vector can be thought of as capturing visual attributes of the object, such as its category, color, geometric configuration etc).

Following Wang et al. (2018a), we make the Markov assumption that generation of both and directly depends only on some of the information generated thus far. Specifically, we assume that depends only and , and depends only on , and . This corresponds to the following form for :


We refer to the distribution as Layout Generation and to as Frame Generation. Following Wang et al. (2018a) we assume that these are deterministic distributions. For example, for frame generation is a deterministic function of . We next describe these functions.

The Layout Generating Function: At time we want to use the previous layout and current action graph to predict the current layout . The rationale is that the captures the current state of the actions and can thus “propagate” to the next layout. This prediction requires integrating information from all boxes as well as the edges of . Thus, a natural architecture for this task is a Graph Convolutional Network (GCN) Kipf and Welling (2016) that operates on the graph whose nodes are “enriched” with the layouts . Formally, we construct a new graph of the same structure as , with new features on nodes and edges. At the node corresponding to object the features are an embedding of the category and the layout of the object from . The features on edges are from

. The GCN first applies a neural network to both node and edge features. Then, node and edge features are repeatedly re-estimated using a standard GCN aggregation operations. For more information please see Supplemental Section 1.2. After applying the above re-estimation for

steps, each node feature is used to extract the new layout by applying a neural net to its current features.

The Frame Generating Function: After obtaining the layout we wish to use it along with and to predict the next frame. The idea is that characterize how objects should move, and shows their last physical appearance. Combining these two information sources we should be able to generate the next frame accurately. As a first step, we estimate the optical flow at time , denoted by . We let . The idea is that given the previous frame and two consecutive layout, we should be able to predict in which direction pixels in the image will move, namely predict the flow. The optical flow network is similar to Ilg et al. (2017) and based on residual networks He et al. (2016). This network will be trained using an auxiliary loss based on estimated flows (see Section 4.2). Given the flow and previous frame a natural estimate of the next frame is to use a warping function Zhou et al. (2016) . Finally we fine-tune via a network that provides an additive correction resulting in the final frame prediction: , where the network is the SPADE generator from Park et al. (2019).

4.2 Losses and Training

Our model contains several intermediate representations: layouts , flows , and pixels . For training, we assume we have ground truth frames and ground truth layouts . We calculate flows from ground truth frames using the iterative Lucas-Kanade algorithm Bouguet (2000) obtaining (we do not use GT here, since these are not ground truth flows). Our losses below use these training signals.

Layout loss . The loss between ground truth bounding boxes and the predicted ones (here we ignore the object descriptor part of ): .
Pixel Action discriminator loss . For the generated pixels we employ a GAN type loss that uses a discriminator between generated frames and ground truth frames . Importantly, the discriminator is conditioned on the action graph and ground truth layout, since generating is conditioned on these.333Discriminators for sequences can also be considered, but the simpler version works, and is faster to train. Formally, let be a discriminator function with output in . We adopt a similar multi-scale PatchGAN discriminator Isola et al. (2017) that was used in pix2pixHD Wang et al. (2018b). The loss is then the GAN loss (e.g., see Isola et al. (2017)):


where corresponds to sampling frames from the ground truth videos, and corresponds to sampling from the generated videos. In the loss for generation we use ground-truth layout since this allows for faster training in practice. Optimization of this loss is done in the standard way of alternating gradient ascent on parameters and descent on generator parameters.
Flow loss . The flow loss includes two terms. The loss between the flow estimated from GT frames and the predicted flow , and the warping loss which measures the error between the warps of the previous frame and the predicted next frame as in Wang et al. (2018a) (Eq. 8 therein).
Perceptual Loss . We add the VGG feature matching loss as in Dosovitskiy and Brox (2016); Johnson et al. (2016); Wang et al. (2018b).

The final optimization problem is to minimize a weighted sum of the above losses, with weights and . Minimization is with respect to all generator parameters (see Supp. for more information).

5 Experiments

In this section we evaluate our AG2Vid model on the CATER and Something Something V2 datasets. For each dataset we learn an AG2Vid model with a given set of actions. We then perform evaluation both on the visual quality of the resulting videos, and on their semantic agreement with the generated actions. For full details about training, evaluation and ablation tests, see the Supplementary.

Figure 4: Comparison of baselines methods. The top two rows are based on CATER videos, while the bottom two rows are based on Something Something videos. The Ours + Flow model refers to our model without the network (only flow prediction). Click the image to play the video clip in a browser.

Implementation details: The GCN model uses hidden layers and an embedding layer of units for each object and action. For optimization we use the ADAM Kingma and Ba (2014) with and . Every model was trained on an NVIDIA V100 GPU for approximately 2 weeks. For loss weights (see Section 4.2) we use and . During training, we use sequences of frames and a batch size of for the Frame Generating Function and sequences of for the Layout Generating Function.

Datasets: We use the following datasets: (1) CATER  Girdhar and Ramanan (2020) is a synthetic video dataset originally created for action recognition and reasoning. The main entities in the data are objects, spatial relations and actions over objects. Every object has a color, shape, size and material attributes. Actions include “rotate”, “cover”, “pick place” and “slide” and every action has an indicated start and end time. We use the standard CATER train partition (3849 videos) and split the validation into 30% val (495 videos) and use the rest for testing (1156 videos). CATER videos are given in 24 FPS, we subsample and use 8 FPS in all our CATER experiments. (2) Something-Something Goyal et al. (2017) is an action recognition dataset and benchmark, containing videos of basic actions. Here we included videos of the 8 most frequent actions. These include for example “Putting [something] on a surface”, “Moving [something] up” and “Covering [something] with [something]”.

Performance metrics: We evaluate the predicted layout locations using the mean intersection over union (mIOU) and the quality of the generated frames using the Learned Perceptual Image Patch Similarity (LPIPS) Zhang et al. (2018). Additionally, we evaluate the visual quality of the generated videos by human annotators using Amazon Master Mechanical Turk. In the “Visual Quality” task we compared to two generation algorithms, and raters were asked to says which of two generated videos is more realistic. The metric for this task was the fraction of times an algorithm was selected. In the “Semantic Accuracy” task, raters were asked to select the action category describing the video, and the average accuracy (compared to the action that was meant to be generated) was computed.

Baselines: We compare AG2Vid to several state of the art generators. Since our action-graph to video is a new task, there is no baseline for generating layouts as our GCN does. Thus, baselines below either use our predicted layout, or just predict from the first frame without conditioning on actions (in the latter case, we evaluate video quality and not action semantics). 1) Ours + V2V: Vid2Vid Wang et al. (2018a) learns a mapping from input semantic segmentation maps to output videos. In our case, we only have our predicted layout as input, and thus train Vid2Vid with it. Thus, this baseline shares our GCN model, but uses a different pixel generation model. 2) CVP Ye et al. (2019): This model uses an initial input image and layout for future frame prediction while reasoning about entity interactions. Since the model only takes the initial frame but not the action-graph we do not expect it to capture action semantics. Instead, we use it to evaluate how well a realistic video can be generated from a single frame (recall that our model also uses the first frame). 3) Sg2Im Johnson et al. (2018): Both CATER and Something-Something contain frame-level scene graph annotations. This baseline uses a scene graph to image model to generate a video from this scene graph sequence. This model does not condition over the action or initial frame and serves only for comparison in terms of realistic generation.

Figure 5: Compositional action synthesis in Something-Something and CATER. The objects involved in the composed actions are highlighted. Click the image to play the video clip in a browser.
Methods mIOU LPIPS Human
Sg2Im Johnson et al. (2018) 18.1 - 0.39 - 50.6 -
CVP Ye et al. (2019) 69.4 50.7 0.31 0.59 42.2 25.6
Ours + V2V Wang et al. (2018a) 88.2 59.4 0.16 0.32 50.0 50.0
Ours 88.2 59.4 0.09 0.25 74.4 84.0
Table 1: Quantitative comparisons of of different models. Evaluation is done on the CATER and Something-Something V2 datasets on resolution . The human raters were given the “Visual Quality” task described in the text, where all pairwise comparisons were with respect to the V2V baseline.

Composing New Actions: A key advantage of Action Graphs, is that we can compose new unseen complex actions at inference time by utilizing existing atomic actions. To demonstrate this capability, we compose new actions as follows. For CATER, we created the following two new actions: 1) “swap” is created by constructing the edges (, “slide”, , , ) and (, “pick place”, , , ). This results in sliding towards and jumping towards , hence swapping places. 2) “huddle” is created by employing the “contain” action for every object in the scene over the Snitch object. For Something-Something Goyal et al. (2017), we composed the “push-left” and “move-down” to form the “left-down” action, and “push-right” and “move-up” to right up the objects by forming the “right-up” action. We evaluate the semantic accuracy of the composed objects, via the rating mechanism described above.

5.1 Results

Generation results for the two datasets can be seen in Figure 4, and examples of generated composed new actions are in Figure 5. For additional qualitative examples see the Supplementary.

Visual Quality: Table 1 compares the four video generation methods in terms of visual quality (and not action semantics). It can be seen that our AG2Vid approach results in the best quality generation across the different metrics. In terms of layout accuracy (mIOU), it is not surprising that both SG2IM and CVP do not perform as well as our method, as they are not conditioned on the action graph. However, the IOU is not random CVP since the first frame is somewhat predictive of locations in the rest of the video. In terms of LPIPS metric our approach signifcantly outperforms the others, as well as in the human rating, where humans judged it to be more realistic than the other baselines.
Timing Actions: To evaluate the extent to which AG’s can control the execution timing, we generated similar AGs in different timings and asked annotators to choose in which video the action is executed first. In 89.45% of the cases, the annotators were in agreement with the intended result. For the full description of the experiment, please refer to the Supplementary Material.
Semantic Quality: To evaluate the semantic quality of the generated actions, we’ve constructed AGs of single actions and generated corresponding videos. For every such video, raters were asked to assign an action category. Table 3 reports the accuracy on this task for the two datasets. For each dataset, we evaluate on the up eight most frequent actions in the data. In addition, we separately evaluated two “made up” actions that resulted from the composition of tasks in the data. For the full results over all actions please refer to the Supplementary Material.
Ablations: To understand the contribution of the different losses to generation quality, we perform an ablation study of adding one loss at a time. Table 2 reports results, showing that the perceptual loss significantly improves performance on CATER. We observe the results of Perceptual Loss vs. Action Discriminator are not conclusive since the metric is based on the perceptual metric, and both minimized it. Thus, we further asked annotators to compare the visual quality of them and found that the latter was judged better in and of the cases on the CATER and Smth. Finally, to test the contribution of the layout component, we did a simple ablation of generating random layouts. The resulting was for CATER and Smth and LPIPS of for CATER and Smth. These are significantly worse than results for the model with our GCN layout prediction.

Methods mIOU LPIPS
Flow Loss 88.23 59.36 0.255 0.258
+ Perceptual Loss 88.23 59.36 0.097 0.252
+ Action Discriminator Loss 88.23 59.36 0.090 0.254
Table 2: Ablations results of the AG2Vid model for different losses. The losses are added one by one.


Methods Standard Actions Composed Actions
Slide Cont. PP Rotate Right Uncov. Up Take Swap Hudd. RU DL
Generated action 93.5 66.7 88.9 78.2 100. 50.0 100. 75.0 92.1 98.6 75.0 100.
GT action - - - - 100. 100. 90.0 100. - - - -
Table 3: Human evaluation of the semantic quality of the generated videos. For each generated video with a given action, we asked raters to select the action described in the video. Actions above correspond to: Slide, Contain, Pick Place, Rotate, Push right, Uncover, Move up, Take, Swap, Huddle, Right-Up and Down-Left.

6 Discussion

We present a video synthesis approach with a new Action Graph formalism, that describes how multiple objects interact in a scene over time. By using this formalism, we can synthesize complicated compositional videos and construct new and unseen actions. Although our approach outperforms previous methods, our model still fails in several situations. First, our model depends on the initial frame and layout. This could be potentially addressed by using an off-the-shelf image generation model. The formal AG representation is designed for describing complex semantic information in an easy-to-grasp way. However, while it can represent the actions present in the datasets utilized in this work, it is possible that the representation of other actions might require extensions of it. Another possible drawback is the evaluation metrics. Generation evaluation is based solely on pixel-level and not by testing the semantic of actions in the video. We used human evaluation to solve this problem, although it should be solved by proposing better automatic metrics. In addition, the quality of natural video synthesis can still be improved, as can be seen in our results on the Something-Something dataset. We believe that better pixel synthesis methods could be integrated easily into our approach. Finally, the Action Graph formalism also has limitations. It cannot describe a rich action language since, unlike natural language, which captures not only the “What” - categories of actions and the objects they act upon but also the “How” - the adverbs describing properties of the action. This formalism can be alleviated by proposing attributes over the action itself, which we leave for future work.


This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD, NSF, BAIR, and BDD. We would also like to thank Anna Rohrbach for valuable feedback and comments on drafts, and Lior Bracha for running the MTurk experiments.


  • O. Ashual and L. Wolf (2019) Specifying object attributes and relations in interactive scene generation. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 4561–4569. Cited by: footnote 2.
  • P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. (2016) Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pp. 4502–4510. Cited by: §2.
  • J. Bouguet (2000) Pyramidal implementation of the lucas kanade feature tracker description of the algorithm. Intel Corporation Microprocessor Research Labs. External Links: Link Cited by: §4.2.
  • A. Brock, J. Donahue, and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §1.
  • C. Chan, S. Ginosar, T. Zhou, and A. A. Efros (2019) Everybody dance now. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5933–5942. Cited by: §2.
  • A. Dosovitskiy and T. Brox (2016) Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 658–666. Cited by: §4.2.
  • O. Fried, A. Tewari, M. Zollhöfer, A. Finkelstein, E. Shechtman, D. B. Goldman, K. Genova, Z. Jin, C. Theobalt, and M. Agrawala (2019) Text-based editing of talking-head video. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–14. Cited by: §2.
  • S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik (2019) Learning individual styles of conversational gesture. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3497–3506. Cited by: §2.
  • R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman (2019)

    Video action transformer network

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 244–253. Cited by: §2.
  • R. Girdhar and D. Ramanan (2020) CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning. In ICLR, Cited by: §1, §2, §3, §5.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. (2017) The" something something" video database for learning and evaluating visual common sense.. In ICCV, pp. 5. Cited by: §1, §3, §5, §5.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. Conf. Comput. Vision Pattern Recognition, pp. 770–778. Cited by: §4.1.
  • R. Herzig, A. Bar, H. Xu, G. Chechik, T. Darrell, and A. Globerson (2019a) Learning canonical representations for scene graph to image generation. arXiv preprint arXiv:1912.07414. Cited by: §1, footnote 2.
  • R. Herzig, E. Levi, H. Xu, H. Gao, E. Brosh, X. Wang, A. Globerson, and T. Darrell (2019b) Spatio-temporal action graph networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.
  • R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson (2018) Mapping images to scene graphs with permutation-invariant structured prediction. In Advances in Neural Information Processing Systems (NIPS), Cited by: §2.
  • E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §4.1.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. CVPR. Cited by: §4.2.
  • A. Jain, A. R. Zamir, S. Savarese, and A. Saxena (2016)

    Structural-rnn: deep learning on spatio-temporal graphs

    In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 5308–5317. Cited by: §2.
  • J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles (2019) Action genome: actions as composition of spatio-temporal scene graphs. arXiv preprint arXiv:1912.06992. Cited by: §2.
  • J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In European Conference on Computer Vision, Cited by: §4.2.
  • J. Johnson, A. Gupta, and L. Fei-Fei (2018) Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1219–1228. Cited by: §2, §4.3, Table 1, §5, footnote 2.
  • J. Johnson, R. Krishna, M. Stark, L. Li, D. Shamma, M. Bernstein, and L. Fei-Fei (2015) Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3668–3678. Cited by: §2.
  • T. Karras, S. Laine, and T. Aila (2019a) A style-based generator architecture for generative adversarial networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2019b) Analyzing and improving the image quality of StyleGAN. CoRR abs/1912.04958. Cited by: §1.
  • D. Kim, S. Woo, J. Lee, and I. S. Kweon (2019) Deep video inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5792–5801. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
  • T. Kipf, E. Fetaya, K. Wang, M. Welling, and R. Zemel (2018) Neural relational inference for interacting systems. arXiv preprint arXiv:1802.04687. Cited by: §2.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2, §2, §4.1.
  • R. Krishna, I. Chami, M. S. Bernstein, and L. Fei-Fei (2018) Referring relationships. ECCV. Cited by: §2.
  • A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2016) Autoencoding beyond pixels using a learned similarity metric. In

    Proceedings of The 33rd International Conference on Machine Learning

    pp. 1558–1566. Cited by: §1.
  • J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell (2020) Something-else: compositional action recognition with spatial-temporal interaction networks. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §3.
  • M. Mathieu, C. Couprie, and Y. LeCun (2015) Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §2.
  • T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346. Cited by: §4.1.
  • W. V. O. Quine (1985) Events and reification. Actions and events: Perspectives on the philosophy of Donald Davidson, pp. 162–171. Cited by: §1.
  • M. Raboh, R. Herzig, G. Chechik, J. Berant, and A. Globerson (2020) Differentiable scene graphs. In Winter Conf. on App. of Comput. Vision, Cited by: §2.
  • B. Schroeder, S. Tripathi, and H. Tang (2019) Triplet-aware scene graph embeddings. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: §2.
  • S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pp. 70–80. Cited by: §2.
  • A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe (2019) Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2377–2386. Cited by: §2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • Y. Song, J. Zhu, D. Li, X. Wang, and H. Qi (2018) Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786. Cited by: §2.
  • C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid (2018) Actor-centric relation network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 318–334. Cited by: §2.
  • J. Walker, C. Doersch, A. Gupta, and M. Hebert (2016) An uncertain future: forecasting from static images using variational autoencoders. In European Conference on Computer Vision, pp. 835–851. Cited by: §2.
  • T. Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro (2019) Few-shot video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro (2018a) Video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2, §4.1, §4.1, §4.2, Table 1, §5.
  • T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018b) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. Cited by: §1, §1, §4.2.
  • X. Wang and A. Gupta (2018) Videos as space-time region graphs. In ECCV, Cited by: §2.
  • N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, and A. Tacchetti (2017) Visual interaction networks: learning a physics simulator from video. In Advances in neural information processing systems, pp. 4539–4547. Cited by: §2.
  • N. Xu, A. Liu, J. Liu, W. Nie, and Y. Su (2019) Scene graph captioner: image captioning based on structural visual representation. Journal of Visual Communication and Image Representation, pp. 477–485. Cited by: §2.
  • S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In

    Thirty-second AAAI conference on artificial intelligence

    Cited by: §2.
  • Y. Ye, M. Singh, A. Gupta, and S. Tulsiani (2019) Compositional video prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10353–10362. Cited by: §2, Table 1, §5.
  • K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum (2019) Clevrer: collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442. Cited by: §2.
  • J. M. Zacks and B. Tversky (2001) Event structure in perception and conception.. Psychological bulletin 127 (1), pp. 3. Cited by: §1.
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In CVPR, Cited by: §5.
  • T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros (2016) View synthesis by appearance flow. In ECCV, Cited by: §1, §4.1.

1 Losses and Training

We elaborate on the Flow and Perceptual losses from Section 4.2.

Optical flow losses .

The flow loss includes two terms. The first is the warping loss which measures the error between the warps of the previous frame and the ground truth of the next frame , and the second is the error between the estimated flow from the GT frames and the predicted flow .


where as defined in Section 4.1. This flow loss proposed previously in [45, 55].

Perceptual loss .

This is the standard perceptual loss as in pix2pixHD [46]. In particular, we use the VGG network [40] as a feature extractor and minimize the error between the extracted features from the generated and ground truth images from layers.


where denotes the -th layer with elements of the VGG network. We sum the above over all frames in the videos.

The overall optimization problem is to minimize the weighted sum of the losses:


where are all the trainable parameters of the generative model, is the Layout loss, and is the pixel action discriminator loss from Section 4.2. In addition to the loss terms in Equation 5, we use a feature matching loss [31, 46]

to match the statistics of features extracted by the GAN discriminators.

2 Graph Convolution Network

As explained in the main paper, we used a Graph Convolution Network (GCN) [29] to predict the layout at time step . The GCN uses the structure of the action graph, and propagates information along this graph (in iterations) to obtain a set of layout coordinates per object.

Each object category is assigned a learned embedding and each action is assigned a learned embedding . We next explain how to obtain the layouts using a GCN. Consider the action graph at time with the corresponding clocked edges . Denote the layout for node at time by . The GCN iteratively calculates a representation for each object and each action in the graph. Let be the representation of the object in the layer of the GCN. Similarly, for each edge in given by let be the representation of this edge in the layer. These representations are calculated as follows. At the GCN input, we set the representation for node to be: . And, for each edge set . All representations at time are transformed to dimensional vectors using an MLP. Next, we use three functions (MLPs) , each from to . These can be thought of as processing three vectors on an edge (the subject, action and object representations) and returning three new representations. Given these functions, the updated object representation is the average of all edges incident on :444Note that a box can appear both as a “subject” and an “object” thus two different sums in the denominator.


Similarly, the representation for edge is updated via: .

Finally, we transform the GCN representations above at each time-step to a layout as follows. Let denote the number of GCN updates. The layout for node is the outputs of an two MLPs applied to . The output of the first MLP are the four bounding box coordinates, and the output of the second MLP is the object descriptor. Thus the th object in the layout is simply the set of the predicted normalized bounding box coordinates.

Figure 6: Qualitative examples for the generation of actions on the CATER dataset. We use the AG2Vid model to generate videos of four standard actions and two composed unseen actions (“Swap” and “Huddle”). The objects involved in actions are highlighted. Click the image to play the video clip in a browser.

3 Actions

For the Something Something dataset [12], we use the eight most frequent actions. These include: “Putting [something] on a surface”, “Moving [something] up”, “Pushing [something] from left to right”, “Moving [something] down”, “Pushing [something] from right to left”, “Covering [something] with [something]”, “Uncovering [something]”, “Taking [one of many similar things on the table]” . See Figure 7 for qualitative examples. The box annotations of the objects from the videos are taken from [32].

For the CATER dataset we include the four actions as provided by [10]. These include: “Rotate”, “Cover”, “Pick Place” and “Slide”. See Figure 6 for qualitative examples.

Figure 7: Qualitative examples for the generation of actions on the Something Something dataset. We use our AG2Vid model to generate videos of eight standard actions and two composed unseen actions (“Right Up” and “Down Left”). Click the image to play the video clip in a browser.

4 Experiments and Results

4.1 Human Evaluation of Action Timing in Generated Videos

As described in Section 5.1, we evaluated to which extent the action graphs (AGs) can control the timing execution of actions on the CATER dataset. Thus, we generated 90 pairs of action graphs where the only difference between the two graphs is the timing of one action. We then asked the annotators to select the video where the action is executed first. The full results are depicted in Table 4, and visual examples are shown in Figure 8. The results for all actions but “Rotate” are consistent with the expected behavior, indicating that the model correctly executes actions in a timely fashion. The “Rotate” action is especially challenging to generate since it occurs within the intermediate layout. It is also easier to miss as it involves a relatively subtle change in the video.

Methods Standard Actions Composed Actions
Slide Contain Pick Place Rotate Swap Huddle
Ag2Vid (Ours) 96.7 100.0 90.0 56.7 93.3 100.0
Table 4: Human evaluation of timing in generated videos (see Section 4.1). The table reports accuracy of human annotator answer with respect to the true answer.

Figure 8: Timing experiment examples in CATER. We show the clock edges can manipulate the timing of a video by controlling when the action is performed to achieve goal-oriented video synthesis. The objects involved in “rotate” are highlighted. Click the image to play the video clip in a browser.

4.2 Human Evaluation of Semantic Quality in Generated Videos

To test the degree to which the generated videos match their corresponding actions, we generated twenty videos per action for the Something-Something dataset and asked three different human annotators to evaluate each video. Each annotator was asked to pick the action that best describes the video out of the list of possible actions. We provide the results in Figure 5. Each cell in the table corresponds to the class recall of a specific action. To determine if a video correctly matches its corresponding action, we used the majority voting over the answers of all annotators.

It turns out that humans do not perform perfectly in the above task. We quantified this effect in the following experiments on the Something-Something dataset. We used the above annotation process for ground-truth videos (see “Real” row in Table 5). Interestingly, it can be seen from the reported accuracy in Table 5 that our generated action videos of “Move Down” and “Take” are more easily recognizable by humans than the ground truth videos.

For the CATER dataset, we did not perform such human evaluation of predicted actions, since CATER videos contain multiple activities.


Video Source Standard Actions
Right Up Down Left Put Take Uncover Cover
Generated 100. 50. 100. 75. 95. 80. 25. 55.
Real 100. 100. 90. 100. 100. 65. 100. 85.
Table 5: The semantic quality evaluated by humans of the generated and real action videos. We asked raters to select the action described in the video for each synthesized video with a given action. The table reports the accuracy of the human annotators with respect to the true action underlying the video. Actions above correspond to: ’Pushing [something] from left to right’, ’Moving [something] up’, ’Moving [something] down’, ’Pushing [something] from right to left’, ’Putting [something] on a surface’, ’Taking [one of many similar things on the table]’, ’Uncovering [something]’, ’Covering [something] with [something]’ .

4.3 Comparing AG2Vid to Scene-Graph Based Generation

Scene graphs are an expressive formalism for describing image content. Both datasets we use have frame-level scene graph annotation. Thus, we wanted to compare generation from these scene graphs with generation from action graphs. Towards this end, we used a scene-graph-to-image model [22] trained to generate the images in the videos from their corresponding scene graphs. This model does not condition the action or initial frame and serves only for comparison in terms of realistic generation. It can be seen in Figure 9 that the temporal coherency of AG2Vid is more consistent and coherent than the sequence of scene graphs.

Figure 9: Comparing Sg2Im and Ag2Vid results in CATER. Each column is a different sample. Click the image to play the video clip in a browser.