Zero-Shot Generation of Human-Object Interaction Videos

by   Megha Nawhal, et al.

Generation of videos of complex scenes is an important open problem in computer vision research. Human activity videos are a good example of such complex scenes. Human activities are typically formed as compositions of actions applied to objects – modeling interactions between people and the physical world are a core part of visual understanding. In this paper, we introduce the task of generating human-object interaction videos in a zero-shot compositional setting, i.e., generating videos for action-object compositions that are unseen during training, having seen the target action and target object independently. To generate human-object interaction videos, we propose a novel adversarial framework HOI-GAN which includes multiple discriminators focusing on different aspects of a video. To demonstrate the effectiveness of our proposed framework, we perform extensive quantitative and qualitative evaluation on two challenging datasets: EPIC-Kitchens and 20BN-Something-Something v2.



There are no comments yet.


page 1

page 6

page 8

page 12

page 14


Zero-Shot Action Recognition from Diverse Object-Scene Compositions

This paper investigates the problem of zero-shot action recognition, in ...

Conditional MoCoGAN for Zero-Shot Video Generation

We propose a conditional generative adversarial network (GAN) model for ...

Human Hands as Probes for Interactive Object Understanding

Interactive object understanding, or what we can do to objects and how i...

Learning Temporal Transformations From Time-Lapse Videos

Based on life-long observations of physical, chemical, and biologic phen...

Finding any Waldo: zero-shot invariant and efficient visual search

Searching for a target object in a cluttered scene constitutes a fundame...

The "something something" video database for learning and evaluating visual common sense

Neural networks trained on datasets such as ImageNet have led to major a...

Demystifying AlphaGo Zero as AlphaGo GAN

The astonishing success of AlphaGo ZeroSilver_AlphaGo invokes a worldwid...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual imagination and prediction are fundamental components of human intelligence. Arguably, the ability to create realistic renderings from symbolic representations are considered prerequisite for broad visual understanding. Motivated by these factors, computer vision has seen rapid advances in the field of image generation over the last few years. The existing models are capable of generating impressive results in this static scenario, ranging from hand-written digits [goodfellow2014generative, denton2015deep, arjovsky2017wasserstein] to realistic scenes [van2016conditional, zhang2017stackgan, karras2017progressive, brock2018large, isola2017image]. The progress on video generation [vondrick2016generating, saito2017temporal, tulyakov2018mocogan, he2018probabilistic, wang2018vid2vid, bansal2018recycle, wang2019fewshotvid2vid], on the other hand, has been relatively moderate and remains an open and challenging problem. In addition, while most approaches focus on the expressivity and controllability of the underlying generative models, their ability to generalize to unseen scene compositions has not received as much attention. However, such generalizability is an important cornerstone of robust visual imagination as it demonstrates the capacity to reason over the elements of a scene.

Figure 1: Zero-Shot Generation of Human-Object Interactions. Given the action sequences “wash aubergine” and “put tomato”, a visually intelligent agent should be able to imagine the action sequences for unseen action-object compositions, i.e., “wash tomato” and “put aubergine”.

We posit that the domain of human activities constitutes a rich realistic testbed for video generation models. Human activities involve people interacting with objects in complex ways, presenting numerous challenges for generation – the need to (1) render a variety of objects; (2) model the temporal evolution of the effect of actions on object; (3) understand the spatial relations and interactions; and (4) overcome the paucity of data for an exponential set of action-object pairings. The last, in particular, is a critical challenge that also serves as an opportunity for designing and evaluating generative models that can generalize to myriad, possibly unseen, action-object compositions. For example, consider Figure 1. The activity sequences for “wash aubergine” (action : “wash”; object : “aubergine”) and “put tomato”(action : “put”; object : “tomato”) are observed in the training data. A robust visual imagination would then allow an agent to imagine the video for “wash tomato” and “put aubergine” .

Consolidating the challenge of developing generalizable generative models with the complexity of human activity video generation, we propose a novel framework for generating human-object interaction (HOI) videos given unseen action-object composition. We refer to this task as zero-shot HOI video generation. To the best of our knowledge, our work is the first to propose and address this problem. In doing so, we push the envelope on conditional (or controllable) video generation and focus squarely on the model’s ability to generalize in a zero-shot compositional setting. This setting verifies that the model is capable of semantic disentanglement of the action and objects in a given context and recreating them separately in other contexts.

The desiderata for performing zero-shot HOI video generation include: (1) mapping the content in the video to the right semantic category, (2) ensuring spatial and temporal consistency across the frames of a video, and (3) producing interactions with the right object in the presence of multiple objects. Based on these observations, we introduce a novel multi-adversarial learning scheme involving multiple discriminators each focusing on different aspects of an HOI video. Our framework HOI-GAN generates a fixed length video clip given an action, an object, and a target scene serving as the context. During training of the generator, our framework utilizes four discriminators – three pixel-centric discriminators, namely, frame discriminator, gradient discriminator, video discriminator; and one object-centric relational discriminator. The three pixel-centric discriminators ensure spatial and temporal consistency across the frames. The novel relational discriminator leverages spatio-temporal scene graph to reason over the object layouts in videos ensuring right interactions among objects. Through experiments, we show that HOI-GAN framework is able to disentangle objects and actions and learns to generate videos with unseen compositions.

In summary, our contributions are as follows:

  • We introduce the task of zero-shot HOI video generation. Specifically, given a set of videos depicting certain action-object compositions, we propose to generate unseen compositions having seen the target action and target object individually, i.e., the target action was paired with a different object and the target object was involved in a different action.

  • We propose a novel adversarial learning scheme and introduce HOI-GAN framework to generate HOI videos in zero-shot compositional setting.

  • We demonstrate the effectiveness of HOI-GAN through empirical evaluation on two challenging HOI video datasets: 20BN-something-something v2[goyal2017something] and EPIC-Kitchens [Damen2018EPICKITCHENS]. We perform both quantitative and qualitative evaluation of the proposed approach and compare with state-of-the-art approaches.

Overall, our work is valuable in facilitating research in the direction of enhancing generalizability of generative models for realistic videos.

2 Related Work

Our paper builds on the prior work in: (1) modeling of human-object interactions and (2) GAN-based video generation. In addition, we also discuss other literature relevant to HOI video generation in zero-shot compositional setting.

Modeling Human-Object Interactions. The study of human-object interactions (HOIs) has a rich history in the field of computer vision. Earlier research attempts aimed at studying object affordances [grabner2011makes, kjellstrom2011visual] and semantic-driven understanding of object functionalities [stark1991achieving, gupta2007objects]. Recent work on modeling HOIs in images range from studying semantics and spatial features of interactions between humans and objects [delaitre2012scene, zellers2018neural, gkioxari2018detecting] to action information[fouhey2014people, desai2012detecting, yao2010modeling]. Furthermore, there have been attempts to create large scale image and video datasets to study HOI [krishna2017visual, chao2015hico, chao2018learning, goyal2017something]. To model dynamics in HOIs, recent works have proposed methods that jointly model actions and objects in videos [kato2018compositional, sigurdsson2017actions, kalogeiton2017joint]. Inspired by these approaches, we model an HOI video as a composition of an action and an object.

GAN-based Image/Video Generation. Generative Adversarial Network (GAN) [goodfellow2014generative] and its variants [denton2015deep, arjovsky2017wasserstein, zhao2017energy] have shown tremendous progress in high quality image generation. Built over these techniques, conditional image generation using various forms of inputs to the generator such as textual information [reed2016generative, zhang2017stackgan, xu2018attngan], category labels [odena2017conditional, miyato2018cgans], and images [kim2017learning, isola2017image, zhu2017unpaired, liu2017unsupervised] have been widely studied. This class of GANs allows the generator network to learn a mapping between conditioning variables and the real data distribution, thereby allowing control over the generation process.

Extending these efforts to conditional video generation is not straightforward as generating a video involves modeling of both spatial and temporal variations. Vondrick et al[vondrick2016generating] proposed the Video GAN (VGAN) framework to generate videos using a two-stream generator network that decouples foreground and background of a scene. Temporal GAN (TGAN) [saito2017temporal] employs a separate generator for each frame in a video and an additional generator to model temporal variations across these frames. VGAN and TGAN have primarily been proposed for unconditional generation. MoCoGAN [tulyakov2018mocogan] disentangles the latent space representations of motion and content in a video to perform controllable video generation using seen compositions of motion and content as conditional inputs. In our paper, we evaluate the extent to which these video generation methods generalize when provided with unseen scene compositions as conditioning variables. Furthermore, promising success has been achieved by recent video-to-video translation methods [wang2018vid2vid, wang2019fewshotvid2vid, bansal2018recycle] wherein video generation is conditioned on a corresponding semantic video. In contrast, our task does not require semantic videos as conditional input.

Video Prediction/Completion. Video prediction approaches predict future frames of a video given one or a few observed frames using recurrent networks [srivastava2015unsupervised], variational auto-encoders [walker2016uncertain, walker2017pose], adversarial training [mathieu2016deep, liang2017dual], or auto-regressive methods [kalchbrenner2017video]. While video prediction is typically posed as an image-conditioned (past frame) image (future frame) generation problem, it is substantially different from video generation where the goal is to generate a video clip given a stochastic latent space.

Video completion/inpainting refers to the problem of correctly filling up the missing pixels given a video with arbitrary spatio-temporal pixels missing [newson2014video, shen2006video, granados2012background, ebdelli2015video, niklaus2017video]. In contrast, our task goes beyond inpainting of pixels and performs reasoning across various data samples in the dataset to procure the right visual content in a video.

Zero-Shot Learning. Zero-shot learning (ZSL) aims to solve the problem of recognizing classes whose instances may not have been seen during training. In ZSL, external information of a certain form is required to share information between classes to transfer knowledge from seen to unseen classes. A variety of techniques have been used for ZSL ranging from usage of attribute-based information [lampert2009learning, farhadi2009describing], word embeddings [xian2018feature] to WordNet hierarchy [akata2015evaluation] and text-based descriptions [guadarrama2013youtube2text, elhoseiny2013write, zhu2018generative, lei2015predicting]. [xian2017zero] provides a thorough overview of zero-shot learning techniques. Similar to these works, we leverage word embeddings to reason over the unseen compositions of actions and objects and focus on enhancing the generalization capabilities in GANs.

Learning Visual Relationships. Visual relationships in the form of scene graphs, i.e., directed graphs representing relationships (edges) between the objects (nodes) have been used for image caption evaluation [anderson2016spice]

, image retrieval

[johnson2015image] and predicting scene compositions for images [xu2017scene, lu2016visual, newell2017pixels]. Additionally, spatio-temporal graphs have been used to learn representations of complex human activity videos for discriminative tasks [wang2018videos, tsai2019GSTEG]. Our model leverages spatio-temporal scene graphs to ensure the generator learns relevant relations among the objects in a video. Furthermore, in generative setting, [johnson2018image] aims to synthesize an image from the given corresponding scene graph and evaluate the generalizability of an adversarial network to create images with unseen relationships between objects. Similarly, our proposed task of zero-shot HOI video generation focuses on the generalizing generative models to unseen compositions of actions and objects, however, this task is more difficult as it requires learning the mapping of inputs to the spatial as well as temporal variations in a video.

Learning Disentangled Representations for Videos. Various methods have been proposed to learn disentangled representations in videos [tulyakov2018mocogan, hsieh2018learning, denton2017unsupervised], such as, learning representations by decoupling the content and pose [denton2017unsupervised], or separating motion from content using image differences  [villegas2017decomposing]. Similarly, our model implicitly learns to disentangle the action and object information of an HOI video.

3 Hoi-Gan

Intuitively, for a generated human-object interaction (HOI) video to be realistic, it must: (1) contain the object designated by a semantic label; (2) exhibit the prescribed interaction with that object; (3) be temporally consistent; and (4 – optional) occur in a specified scene. Based on this intuition, we propose an adversarial learning scheme in which we train a generator network with a set of 4 discriminators: (1) a frame discriminator , which encourages the generator to learn spatially coherent visual content; (2) a gradient discriminator , which incentivizes to produce temporally consistent frames; (3) a video discriminator , which provides the generator with global spatio-temporal context; and (4) a relational discriminator , which assists the generator in producing right object layouts in a video. We use pretrained word embeddings [pennington2014glove] for semantic representations of actions and objects. All discriminators are conditioned on word embeddings of the action () and object () and trained simultaneously in an end-to-end manner. An overview of our proposed framework HOI-GAN is shown in Figure 2. We now formalize our task and describe each module in more detail.

3.1 Task Formulation

Let and be word embeddings of an action and an object , respectively. Furthermore, let be an image provided as context to the generator. We encode using an encoder to obtain a visual embedding , which we refer to as a context vector. Our goal is to generate a video of length depicting the action performed on the object with context image as the background of . To this end, we learn a function , where is a noise vector sampled from a distribution

, such as a Gaussian distribution.

Figure 2: Architecture Overview. The generator network is trained using 4 discriminators simultaneously: a frame discriminator , a gradient discriminator , a video discriminator , and a relational discriminator . Given the word embeddings of an action , an object , and a context image , the generator learns to synthesize a video with background in which the given action is performed on the given object .

3.2 Model Description

We describe each of the elements of our framework below. Overall, the four discriminator networks, i.e., frame discriminator , gradient discriminator , video discriminator , and relational discriminator are all involved in a zero-sum game with the generator network .

Frame Discriminator. The frame discriminator network learns to distinguish between real and generated frames corresponding to the real video and generated video respectively. Each frame in and is processed independently using a network consisting of stacked conv2d layers, i.e., 2D convolutional layers followed by spectral normalization [miyato2018spectral]

and leaky ReLU layers 

[maas2013rectifier] with

. We obtain a tensor of size

(), where , , and are the channel length, width and height of the activation of the last conv2d layer respectively. We concatenate this tensor with spatially replicated copies of and , which results in a tensor of size . We then apply another conv2d layer to obtain a tensor. We now perform convolutions followed by convolutions and a sigmoid to obtain a -dimensional vector corresponding to the frames of the video . The

-th element of the output denotes the probability that the frame

is real. The objective function of the network

is the loss function:


where is the -th element of the output of .

Gradient Discriminator. The gradient discriminator network enforces temporal smoothness by learning to differentiate between the temporal gradient of a real video and a generated video . We define the temporal gradient of a video with frames as the pixel-wise differences between two consecutive frames of the video. The -th element of is defined as:


The architecture of the gradient discriminator is similar to that of the frame discriminator . The output of is a -dimensional vector corresponding to the values in gradient . The objective function of is


where is the -th element of the output of .

Video Discriminator. The video discriminator network learns to distinguish between real videos and generated videos by comparing their global spatio-temporal contexts. The architecture consists of stacked conv3d layers, i.e., 3D convolutional layers followed by spectral normalization [miyato2018spectral] and leaky ReLU layers [maas2013rectifier] with . We obtain a tensor, where , , , and are the channel length, depth, width, and height of the activation of the last conv3d layer respectively. We concatenate this tensor with spatially replicated copies of and , which results in a tensor of size , where returns the dimensionality of a vector. We then apply another conv3d layer to obtain a tensor. Finally, we apply a convolution followed by a convolution and a sigmoid to obtain the output, which represents the probability that the video is real. The objective function of the network is the following loss function:


Relational Discriminator.

Figure 3: Relational Discriminator. The relational discriminator leverages a spatio-temporal scene graph to distinguish between object layouts in videos. Each node contains convolutional embedding, position and aspect ratio (AR) information of the object crop obtained from MaskRCNN. The nodes are connected in space and time and edges are weighted based on their inverse distance. Edge weights of (dis)appearing objects are set to 0.

In addition to the three pixel-centric discriminators above, we also propose a novel object-centric discriminator . Driven by a spatio-temporal scene graph, this relational discriminator learns to distinguish between the object layouts of real videos and generated videos (see Figure 3).

Specifically, we build a spatio-temporal scene graph from , where the nodes and edges are represented by and respectively. We assume one node per object per frame. Each node is connected to all other nodes in the same frame, referred to as spatial edges. In addition, to represent temporal evolution of objects, each node is connected to the corresponding nodes in the adjacent frames that also depict the same object, referred to as temporal edges. To obtain the node representations, we crop the objects in using MaskRCNN [he2017mask], compute a convolutional embedding for them, and augment the resulting vectors with the aspect ratio and position of the corresponding bounding boxes. The weights of spatial edges in are given by inverse Euclidean distances between the centers of these bounding boxes. The weights of the temporal edges is set to 1 by default. The cases of (dis)appearing objects are handled by setting the corresponding spatial and temporal edges to .

The relational discriminator operates on this scene graph by virtue of a graph convolutional network (GCN) [kipf2017semi] followed by stacking and average-pooling of the resulting node representations along the time axis. We then concatenate this tensor with spatially replicated copies of and to result in a tensor of size . As before, we then apply convolutions and sigmoid to obtain the final output which denotes the probability of the scene graph belonging to a real video. The objective function of the network is given by


Generator. Given the semantic embeddings , of action and object labels respectively, and context vector , the generator network learns to generate video consisting of T frames (RGB) of height and width . We concatenate noise with the conditions, namely, , , and . We provide this concatenated vector as the input to the network G. The network comprises stacked deconv3d layers, i.e

., 3D transposed convolution layers each followed by Batch Normalization 

[ioffe15bn] and leaky ReLU layers [maas2013rectifier] with except the last convolutional layer which is followed by a Batch Normalization layer [ioffe15bn] and a tanh activation layer. The network is optimized according to the following objective function:


Implementation Details. In our experiments, the convolutional layers in all networks, namely,

have kernel size 4 and stride 2. We generate a video clip consisting of

frames having . The noise vector is of length 100. The parameters , and for and and for , , and . To obtain the semantic embeddings and corresponding to action and object labels respectively, we use Wikipedia-pretrained GLoVe [pennington2014glove] embedding vectors of length 300. We provide further implementation details of our model architecture in the supplementary section. For training, we use the Adam [kingma2015adam] optimizer with learning rate 0.0002 and , . We train all our models with a batch size of 32. We use dropout (probability = 0.3) [salimans2016improved] in the last layer of all discriminators and all layers (except first) of the generator.

4 Experiments

We conduct extensive quantitative and qualitative experimentation to demonstrate the effectiveness of the proposed framework HOI-GAN for the task of zero-shot generation of human-object interaction (HOI) videos.

  Scenario Context Generated Output
take spoon (EPIC)
hold cup (SS)
move broccoli (EPIC)
put apple (SS)
Figure 4: Qualitative Results: Videos generated using our best version of HOI-GAN using embeddings for action -object composition and the context frame. We show 6 frames of the video clip generated for both generation scenarios GS1 and GS2. Best viewed in color on desktop. Refer to supplementary section for additional videos generated using HOI-GAN.

4.1 Datasets and Data Splits

We use two datasets for our experiments: EPIC-Kitchens [Damen2018EPICKITCHENS] and 20BN-Something-Something V2 [goyal2017something]. Both of these datasets comprise a diverse set of HOI videos ranging from simple translational motion of objects (e.g. push, move) and rotation (e.g. open) to transformations in state of objects (e.g. cut, fold). Therefore, these datasets, with their wide ranging variety and complexity, provide a challenging setup for evaluating HOI video generation models.

EPIC-Kitchens [Damen2018EPICKITCHENS] contains egocentric videos of activities in several kitchens. A video clip is annotated with action label and object label (e.g. open microwave, cut apple, move pan) along with a set of bounding boxes (one per frame) for objects that the human interacts with while performing the action. There are around 40k instances in the form of across 352 objects and 125 actions. We refer to this dataset as EPIC hereafter.

20BN-Something-Something V2 [goyal2017something] contains videos of daily activities performed by humans. A video clip is annotated with a label , an action template and object(s) on which the action is applied (e.g. ‘hitting ball with racket’ has action template ‘hitting something with something’). There are 220,847 training instances of the form spanning 30,408 objects and 174 action templates. To transform to action-object label pair , we use NLTK POS-tagger. We consider the verb tag (after stemming) in as action label . We observe that all instances of begin with the present continuous form of which is acting upon the subsequent noun. Therefore, we use the noun that appears immediately after the verb as object . Hereafter, we refer to the transformed dataset in the form of as SS.

Splitting by Compositions. To make the dataset splits suitable for zero-shot compositional setting, we first merge the data samples present in the default train and validation splits of the dataset. We then split the combined dataset into train split and test split based on the condition that all the unique object and action labels in appear independently in the train split, however, any composition of action and object present in the test split is absent in train split and vice versa. We provide the details of the splits for both datasets EPIC and SS in the supplementary section.

Generation Scenarios. Recall that the generator network in HOI-GAN framework (Figure 2) has 3 conditional inputs, namely, action embedding, object embedding, and context frame . The context frame serves as the background in the scene. Thus, to provide this context frame during training, we apply binary mask corresponding to the first frame of a real video as , where represents a matrix of size containing all ones and implies elementwise multiplication. This mask contains ones in regions (either rectangular bounding boxes or segmentation masks) corresponding to the objects (non-person classes) detected using MaskRCNN[he2017mask] and zeros for other regions. Intuitively, this helps ensure the generator learns to map the action and object embeddings to relevant visual content in the HOI video.

During testing, to focus on the evaluation of the generator’s capability to synthesize the right human-object interactions, we provide a background frame as described above. This implies that the suitability of the background for the target composition may vary. To reflect this in our evaluation, we design two different generation scenarios befitting either the target action-object composition or just the target action. Specifically, in Generation Scenario 1 (GS1), the input context frame is the masked first frame of the video from test split corresponding to the target action and object (held out during training). In Generation Scenario 2 (GS2), is the masked first frame of a video from the train split depicting the target action but a different object. As such, in GS1, the generator receives a background that may not have been seen during training (harder) but has strong semantic consistency with the corresponding action-object composition it is being asked to generate (easier). In contrast, in GS2, the generator receives a background that has been seen during training (easier) but may not be consistent with the action-object composition it is being asked to generate (harder). Furthermore, these generation scenarios help illustrate that the generator is not memorizing the background and indeed generalizes over scenes.

I       S       D I       S       D I       S       D I       S       D
C-VGAN [vondrick2016generating]
1.8 30.9 0.2
1.4 44.9 0.3
2.1 25.4 0.4
1.8 40.5 0.3
C-TGAN [saito2017temporal]
2.0 30.4 0.6
1.5 35.9 0.4
2.2 28.9 0.6
1.6 39.7 0.5
MoCoGAN [tulyakov2018mocogan]
2.4 30.7 0.5
2.2 31.4 1.2
2.8 17.5 1.0
2.4 33.7 1.4


HOI-GAN (bboxes)
6.0 14.0 3.4
5.7 20.8 4.0
6.6 12.7 3.5
6.0 15.2 2.9
HOI-GAN (masks)
6.2 13.2 3.7
5.2 18.3 3.5
8.6 11.4 4.4
7.1 14.7 4.0
Table 1: Quantitative Evaluation. Comparison of HOI-GAN with C-VGAN, C-TGAN, and MoCoGAN baselines. We distinguish training of HOI-GAN with bounding boxes (bboxes) and segmentation masks (masks). Arrows indicate whether lower () or higher () is better. [I: inception score; S: saliency score; D: diversity score]
I       S       D I       S       D I       S       D I       S       D


1.4 44.2 0.2
1.1 47.2 0.3
1.8 34.7 0.4
1.5 39.5 0.3
2.3 25.6 0.7
1.9 30.7 0.5
3.0 24.5 0.9
2.7 28.8 0.7
2.8 21.2 1.3
2.6 29.7 1.7
3.3 18.6 1.2
3.0 20.7 1.0


2.4 24.9 0.8
2.2 26.0 0.7
3.1 20.3 1.0
2.9 27.7 0.9
5.9 15.4 3.5
4.8 21.3 3.3
7.4 12.1 3.5
5.4 19.2 3.4
6.2 13.2 3.7
5.2 18.3 3.5
8.6 11.4 4.4
7.1 14.7 4.0
Table 2: Ablation Study. We evaluate the contributions of our pixel-centric losses (F,G,V) and relational losses (first block vs. second block) by conducting ablation study on HOI-GAN (masks). The last row corresponds to the overall proposed model.[F: frame discriminator ; G: gradient discriminator ; V: video discriminator ; R: relational discriminator ]

4.2 Evaluation Setup

Evaluation of image/video quality is inherently challenging, thus, we use both quantitative and qualitative metrics.

I       S       D I       S       D I       S       D I       S       D
C-VGAN [vondrick2016generating]
1.1 52.1 0.4
1.1 52.1 0.4
2.1 45.6 0.8
1.9 45.1 0.5
C-TGAN [saito2017temporal]
1.6 65.4 0.4
2.2 28.1 0.5
2.4 36.2 1.1
1.7 42.8 0.6
MoCoGAN [tulyakov2018mocogan]
2.6 25.4 1.0
2.0 34.9 1.0
2.9 22.8 1.3
2.4 27.4 1.5


HOI-GAN (bboxes)
3.8 18.5 2.1
3.2 24.1 2.4
4.9 26.2 2.7
4.0 25.2 2.4
HOI-GAN (masks)
4.3 16.5 2.5
3.9 20.2 1.6
5.8 15.8 3.0
4.5 23.7 2.8
Table 3: Quantitative Evaluation (Effect of Word Embeddings).

Comparison of HOI-GAN with C-VGAN, C-TGAN, and MoCoGAN baselines using one-hot encoded labels instead of embeddings as conditional inputs(default version). (see section 

4.3). Arrows indicate whether lower () or higher () is better. [I: inception score; S: saliency score; D: diversity score]
Ours / Baseline GS1 GS2
HOI-GAN / MoCoGAN 71.7/28.3 69.2/30.8
HOI-GAN / C-TGAN 75.4/34.9 79.3/30.7
HOI-GAN / C-VGAN 83.6/16.4 80.4/19.6
Table 4: Human Evaluation. Human Preference Score (%) for generation scenarios GS1 and GS2. All the results have p-value less than 0.05 implying that the results are statistically significant.

Quantitative Metrics. Inception Score (I-score) [salimans2016improved] is a widely used metric for evaluating image generation models. For images with labels , I-score is defined as where

is the conditional label distribution of an ImageNet

[ILSVRC15]-pretrained Inception model [szegedy2016rethinking]. We adopted this metric for video quality evaluation. We fine-tune a Kinetics [carreira2017quo]

-pretrained video classifier ResNeXt-101

[xie2017aggregated] for each of our datasets and use it for calculating I-score. Moreover, we believe that measuring realism is more relevant for our task as the generation process can be conditioned on any context frame arbitrarily to obtain diverse samples. Therefore, in addition to I-score, we also analyze the first and second terms of the KL divergence separately. We refer to these terms as: (1) Saliency score or S-score (smaller is better) to specifically measure realism, and (2) Diversity score or D-score (higher is better) to indicate the diversity in generated samples.

Human Preference Score. We conduct a user study for evaluating the quality of generated videos. In each test, we present the participants with two videos generated by two different algorithms and ask which among the two better depicts the given activity, i.e., action-object composition (e.g. lift fork). We evaluate the performance of an algorithm as the overall percentage of tests in which that algorithm’s outputs are preferred. This is an aggregate measure over all the test instances across all participants.

Baselines. We compare HOI-GAN with three state-of-the-art video generation approaches: (1) VGAN [vondrick2016generating], (2) TGAN, [saito2017temporal] and (3) MoCoGAN [tulyakov2018mocogan]. We develop the conditional variants of VGAN and TGAN from the descriptions provided in their papers. We refer to the conditional variants as C-VGAN and C-TGAN respectively. We observed that these two models saturated easily in the initial iterations, thus, we added dropout in the last layer of the discriminator network in both models. MoCoGAN focuses on disentangling motion and content in the latent space and is the closest baseline. We use the code provided by the authors.

4.3 Results

In this section, we discuss the results of our extensive qualitative and quantitative evaluation of HOI-GAN.

Comparison with Baselines. We compare our framework with the baselines C-VGAN, C-TGAN, and MoCoGAN. As shown in Table 1, HOI-GAN with different conditional inputs outperforms C-VGAN and C-TGAN by a significant margin in both generation scenarios. In addition, our overall model shows considerable improvement over MoCoGAN, while MoCoGAN has comparable scores to some ablated versions of our models (specifically where gradient discriminator and/or relational discriminator is missing). Furthermore, we varied the richness of the masks in the conditional input context frame ranging from bounding boxes to segmentation masks obtained corresponding to non-person classes using MaskRCNN framework [he2017mask]. As such, the usage of segmentation masks implies explicit shape information as opposed to the usage of bounding boxes where the shape information needs to be learnt by the model. We observe that providing masks during training leads to slight improvements in both scenarios as compared to using bounding boxes (refer to Table 1). We also show the samples generated using the best version of HOI-GAN for the two generation scenarios (Figure 4). See supplementary for more generated samples and additional experiments.

Ablation Study. To illustrate the impact of each discriminator in generating HOI videos, we conduct ablation experiments (refer to Table 2). We observe that addition of temporal information using gradient discriminator and spatio-temporal information using video discriminator lead to improvement in generation quality. In particular, the addition of our scene graph based relational discriminator leads to significant improvement in generation quality resulting in more realistic videos (refer to second block in Table 2).

Effect of Word Embeddings. In our approach, we use word embeddings for the action and object labels to share information among semantically similar categories during training. To demonstrate the impact of using embeddings, we also trained HOI-GAN using one-hot encoded labels corresponding to both actions and objects. We observe that these models perform worse than the models trained using semantic embeddings (refer last two rows of Table 1 and Table 3). Nevertheless, our models still perform significantly better than the baselines in this scenario (refer Table 3).

Human Evaluation. We recruited 15 sequestered participants for our user study. We randomly chose 50 unique categories and chose generated videos for half of them from generation scenario GS1 and the other half from GS2. For each category, we provided three instances, each containing a pair of videos; one generated using a baseline model and the other using HOI-GAN. For each instance, at least 3 participants (ensuring inter-rater reliability) are asked to choose the video that best depicts the given category. The (aggregate) human preference scores for our model versus the baselines range between 69-84% for both generation scenarios (refer Table 4). These results indicate that HOI-GAN generates more realistic videos than the baselines.

Failure Cases.

Context Generated Output
open micro-wave
cut peach
Figure 5: Failure Cases. We show 4 frames of video generated corresponding to the given action and object composition and the context frame (middle). Best viewed in color on desktop.

We discuss the limitations of our framework using qualitative examples shown in Figure 5. For “open microwave”, we observe that although HOI-GAN is able to generate conventional colors for microwave, it shows limited capability to hallucinate such large objects. For “cut peach” (Figure 5), the generated sample shows that our model can learn the increase in count of partial objects corresponding to the action cut and yellow-green color of peach. However, as the model has not observed the interior of peach during training (as cut peach was not in training set), it is unable to create realistic transformations in the state of peach that show the interior clearly. We believe using external knowledge and semi-supervised data can potentially lead to more powerful generative models while still adhering to the zero-shot compositional setting.

5 Conclusion

In this paper, we introduced the task of zero-shot HOI video generation, i.e., generating human-object interaction (HOI) videos corresponding to unseen action-object compositions, having seen the target action and target object independently. Towards this goal, we proposed the HOI-GAN framework that uses a novel multi-adversarial learning scheme and demonstrated its effectiveness on challenging HOI datasets. Future work in video generation can benefit from our idea of using relational adversaries based on scene graphs to synthesize more realistic videos.


6 Supplementary

This section contains the supplementary information supporting the content in the main paper.

  • Video showing samples generated using our proposed HOI-GAN to supplement Section 4.3.

  • Qualitative evaluation of baselines: videos generated using baselines to supplement Section 4.3.

  • Additional evaluation of our model.

  • Details of preprocessing and data splits for each dataset mentioned in Section 4.1.

  • Details of network architecture used for implementation of our model described in Section 3.

6.1 Additional Samples from HOI-GAN

Please open the video file 5262_video.mp4 in a suitable video player to see the samples together.

6.2 Qualitative Evaluation of Baselines

In this section, we provide videos 111Please open the document in Adobe Acrobat Reader to view the videos. generated using the baselines: C-VGAN, C-TGAN, and MoCoGAN (see Figure 6). The videos are generated with same compositions of context frame, action and object as conditional inputs.

6.3 Additional Quantitative Evaluation

In addition to generating videos in zero-shot compositional setting, we also perform experiments by training the model in a semi-supervised setting. In this setting, the model is trained using the full dataset but the samples corresponding to the categories (action-object compositions) in test split are provided with zeros as conditional variables instead of the corresponding embedding information during training. Thus, the models would observe certain set of transitions of objects and actions in the dataset but would not have the explicit mapping of semantic information to the visual content and the transitions. We show that with increase in the number of samples in this manner, our model is indeed able to produce more realistic and diverse samples, i.e., achieves lower S-score and higher D-score than the default zero-shot split configuration (refer last two rows of Table 1 and Table 5). The results also show that our model outperforms the baselines in this setting.

lift fork
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/vgan/sample1/015
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/tgan/sample1/015
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/mocogan/sample1/015
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/hoigan/sample1/015
bend carrot
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/vgan/sample2/015
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/tgan/sample2/015
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/mocogan/sample2/015
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/hoigan/sample2/015
put spoon
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/vgan/sample3/015
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/tgan/sample3/015
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/mocogan/sample3/015
[autoplay,loop, scale=1.5]10sections/supplementary_material/results/images/baselines/hoigan/sample3/015
open lid
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/vgan/sample4/015
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/tgan/sample4/015
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/mocogan/sample4/015
[autoplay,loop, scale=0.75]10sections/supplementary_material/results/images/baselines/hoigan/sample4/015
Figure 6: Qualitative Evaluation of Baselines: Samples generated using baselines, namely, C-VGAN, C-TGAN, MoCoGAN. We also provide samples generated using HOI-GAN corresponding to same composition of conditional inputs. Please open the document in Adobe Acrobat Reader to view these GIF animations.
I       S       D I       S       D I       S       D I       S       D
C-VGAN [vondrick2016generating]
2.0 39.5 0.5
2.0 38.4 0.4
3.1 39.6 0.8
2.6 32.7 0.6
C-TGAN [saito2017temporal]
2.6 28.4 0.7
2.2 28.1 0.5
4.1 26.2 1.1
2.5 26.2 0.9
MoCoGAN [tulyakov2018mocogan]
3.6 16.9 1.3
2.6 27.1 1.5
5.7 14.5 2.2
3.5 23.7 1.8


HOI-GAN (bboxes)
07.1 10.5 4.1
06.2 15.7 4.8
08.3 10.2 5.1
08.7 13.5 6.1
HOI-GAN (masks)
09.5 08.5 5.4
8.1 09.6 4.6
10.4 08.3 6.5
11.2 08.6 6.1
Table 5: Quantitative Evaluation (Unlabeled Data). Comparison of HOI-GAN with C-VGAN, C-TGAN, and MoCoGAN baselines using unlabeled training data (see section 4.3). Arrows indicate whether lower () or higher () is better. [I: inception score; S: saliency score; D: diversity score]

6.4 Data Splits

As described in Section 4.1, we perform new splits of the dataset for the task of zero-shot HOI video generation. In this section, we provide the details of preprocessing and zero-shot compositional splits for datasets EPIC-Kitchens (EPIC) and 20BN-Something-Something V2 (SS).

EPIC: Processing and Splits. The EPIC-Kitchens dataset originally consists of 39,594 video samples of the form , i.e., video with action label and object label

, spanning 125 unique actions and 352 unique objects. We further filtered the dataset to ensure that the video samples contain both ground truth bounding box annotation and MaskRCNN output (NMS threshold = 0.7) in the frames uniformly sampled from a video. We interpolated the sequence if the number of such frames is less than 16. We then split the filtered dataset by action-object compositions to obtain train and test splits suitable for the zero-shot compositional setting,

i.e., all the unique object and action labels in combined dataset appear independently in the train split, however, a certain pair of action and object present in the test split is absent in train split and vice versa. Subsequently we obtained two splits: (1) train split containing 19,895 videos that overall depict 1,128 unique action-object compositions, and (2) test split containing 7,805 videos (568 unique action-object compositions). The final splits consist of compositions spanning 204 unique actions and 63 unique objects.

SS: Processing and Splits. The 20BN-Something-Something V2 dataset originally consists of 220,847 video samples of the form , i.e., video having a label . To transform the dataset instances to the form , we applied NLTK POS-tagger on and obtained verb and noun . In particular, we considered the verb tag (after stemming) in as action label . We observe that all instances of begin with the present continuous form of which is acting upon the subsequent noun. Therefore, we used the noun that appears immediately after the verb as object . We merged the train and validation split of the transformed dataset. We further filtered the dataset to ensure that the video samples contain objects that can be detected using MaskRCNN (NMS threshold = 0.7) in the frames uniformly sampled from a video. We then split the transformed dataset by compositions of action and object to obtain the train and test splits suitable for the zero-shot compositional setting (same as EPIC). Subsequently, we obtained two splits: (1) train split containing 23,511 videos overall that overall depict 671 unique action-object compositions, and (2) test split containing 3,515 videos overall (135 unique action-object compositions). The final splits consist of compositions spanning 48 unique actions and 62 unique objects.

6.5 Architecture Details

As described in Section 3, our model comprises 5 networks involving a generator network and four discriminator networks. We provide the details of the architectures used in our implementation for the generator network, video discriminator, frame discriminator and relational discriminator in Figure 7. The architecture for gradient discriminator is same as that of the frame discriminator shown here.

(i) Generator Network in HOI-GAN
(ii) Video Discriminator Network in HOI-GAN
(iii) Frame Discriminator Network in HOI-GAN
(iv) Relational Discriminator Network in HOI-GAN
Figure 7: Architecture Details. Model architectures used in our experiments for: (i) Generator, (ii)Video Discriminator, (iii) Frame discriminator (gradient discriminator has similar architecture), (iv) Relational Discriminator. Best viewed in color on desktop.