Non-Adversarial Video Synthesis with Learned Priors

03/21/2020 ∙ by Abhishek Aich, et al. ∙ 20

Most of the existing works in video synthesis focus on generating videos using adversarial learning. Despite their success, these methods often require input reference frame or fail to generate diverse videos from the given data distribution, with little to no uniformity in the quality of videos that can be generated. Different from these methods, we focus on the problem of generating videos from latent noise vectors, without any reference input frames. To this end, we develop a novel approach that jointly optimizes the input latent space, the weights of a recurrent neural network and a generator through non-adversarial learning. Optimizing for the input latent space along with the network weights allows us to generate videos in a controlled environment, i.e., we can faithfully generate all videos the model has seen during the learning process as well as new unseen videos. Extensive experiments on three challenging and diverse datasets well demonstrate that our approach generates superior quality videos compared to the existing state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 12

page 13

page 14

page 15

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video synthesis is an open and challenging problem in computer vision. As literature suggests, a deeper understanding of spatio-temporal behavior of video frame sequences can directly provide insights in choosing priors, future prediction, and feature learning

[34, 36]

. Much progress has been made in developing variety of ways to generate videos which can be classified into broadly two categories: class of video generation methods which require random latent vectors without any reference input pixel

[34, 26, 33], and class of video generation methods which do depend on reference input pixels [36, 10, 30]. Current literature contains methods mostly from the second class which often requires some human intervention [36, 10].

Figure 1: Comparison of proposed non-adversarial approach to one representative adversarial approach (MoCoGAN [33]) on the Chair-CAD [2] dataset. Top: MoCoGAN often generates blurry frames including similar type of chairs for different videos as the time step increases. Bottom: Our approach, on the other hand, generates relatively sharper frames, maintaining consistency with the type of chairs unique to each video in the dataset.

In general, Generative Adversarial Networks (GANs) [8] have shown remarkable success in various kinds of video modality problems [17, 14, 19, 26]. Initially, video generation frameworks predominantly used GANs to synthesize videos from latent noise vectors. For example, VGAN [34] and TGAN [26] proposed generative models that synthesize videos from random latent vectors with deep convolutional GAN. Recently, MoCoGAN [33] proposed to decompose a video into content and motion parts using a generator guided by two discriminators. During testing, these frameworks generate videos that are captured in the range of the trained generator, by taking random latent vectors. While all these methods have obtained reasonable performance on commonly used benchmark datasets, they utilize adversarial learning to train their models and hence, inherit the shortcomings of GANs. Specifically, GANs are often very sensitive to multiple factors such as random network initialization, and type of layers employed to build the network [15, 27]. Some infamous drawbacks of GANs are mode-collapse (i.e., able to generate only some parts of the data distribution: see Fig. 1 for an example) and/or vanishing generator gradients due to discriminator being way better in distinguishing fake samples and real samples [1].

Figure 2: Overview of the proposed method. Videos can be broken down into two main parts: static and transient components. To capture this, we map a video (with L frame sequence) into two learnable latent spaces. We jointly learn the static latent space and the transient latent space along with the network weights. We then use these learned latent spaces to generate videos at the inference time. See Sec. 3 for more details.

Non-adversarial approaches [4, 16, 11] have recently been explored to tackle these challenges. For example, Generative Latent Optimization (GLO) [4] and Generative Latent Nearest Neighbor (GLANN) [11] investigate the importance of inductive bias in convolutional networks by disconnecting the discriminator for a non-adversarial learning protocol of GANs. These works show that without a discriminator, a generator can be learned to map the training images in the given data distribution to a lower dimensional latent space that is learned in conjunction with the weights of the generative network. Such procedure not only avoids the mode-collapse problem of the generators, but also provides the user an optimized low dimensional latent representation (embedding) of the data in contrast with the random latent space as in GANs. Recently Video-VAE [10] proposed to use Variational Auto-Encoder (VAE) for conditional video synthesis, either by randomly generating or providing the first frame to the model for synthesizing a video. However, the quality of generated videos using Video-VAE often depends on the provided input frame. Non-adversarial video synthesis without any visual inputs still remains as a novel and rarely addressed problem.

In this paper, we propose a novel non-adversarial framework to generate videos in a controllable manner without any reference frame. Specifically, we propose to synthesize videos from two optimized latent spaces, one providing control over the static portion of the video (static latent space) and the other over the transient portion of the video (transient latent space). We propose to jointly optimize these two spaces while optimizing the network (a generative and a recurrent network) weights with the help of regression-based reconstruction loss and a triplet loss.

Our approach works as follows. During training, we jointly optimize over network weights and latent spaces (both static and transient) and obtain a common transient latent space, and individual static latent space dictionary for all videos sharing the same class (see Fig. 2

). During testing, we randomly choose a static vector from the dictionary, concatenate it with the transient latent vector and generate a video. This enables us to obtain a controlled environment of diverse video generation from learned latent vectors for each video in the given dataset, while maintaining almost uniform quality. In addition, the proposed approach also allows a concise video data representation in form of learned vectors, frame interpolation (using a low rank constraint introduced in

[12]), and generation of videos unseen during the learning paradigm.

The key contributions of our work are as follows.

  • [leftmargin=*]

  • We propose a novel framework for generating a wide range of diverse videos from learned latent vectors without any conditional input reference frame with almost uniform visual quality. Our framework obtains a latent space dictionary on both static and transient portions for the training video dataset, which enables us to generate even unseen videos with almost equal quality by providing combinations of static and transient latent vectors that were not part of training data.

  • Our extensive experiments on multiple datasets well demonstrate that the proposed method, without the adversarial training protocol, has better or at par, performance with current state-of-the-art methods [34, 26, 33]. Moreover, we do not need to optimize the (multiple) discriminator networks as in previous methods [34, 26, 33] which offers a computational advantage.

2 Related Works

Our work relates to two major research directions: video synthesis and non-adversarial learning. In this section, we focus on some representative methods closely related to our work.

2.1 Video Synthesis

Video synthesis has been studied from multiple perspectives [34, 26, 33, 10, 30] (see Tab. 1 for a categorization of existing methods). VGAN [34] demonstrates that a video can be divided into foreground and background using deep neural networks. TGAN [26] proposes to use a generator to capture temporal dynamics by generating correlated latent codes for each video frame and then using an image generator to map each of these latent codes to a single frame for the whole video. MoCoGAN [33] presents a simple approach to separate content and motion latent codes of a video using adversarial learning. The most relevant work for us is Video-VAE [10] that extends the idea of image generation to video generation using VAE by proposing a structured latent space in conjunction with the VAE architecture for video synthesis. While this method doesn’t require a discriminator network, it depends on reference input frame to generate a video. In contrast, our method proposes a efficient framework for synthesizing videos from learnable latent vectors without any input frame. This gives a controlled environment for video synthesis that even enables us to generate visually good quality unseen videos through combining static and transient parts.

Methods Settings
Adversarial learning? Input frame? Input latent vectors?
VGAN [34] ✓(random)
TGAN [26] ✓(random)
MoCoGAN [33] ✓(random)
Video-VAE [10] ✓(random)
Ours ✓(learned)
Table 1: Categorization of prior works in video synthesis. Different from existing methods, our model doesn’t require a discriminator, or any reference input frame. However, since we have learned latent vectors, we have control of the kind of videos the model should generate.

2.2 Non-Adversarial Learning

Generative adversarial networks, as powerful as they are in pixel space synthesis, are also difficult to train. This is owing to the saddle-point based optimization game between the generator and the discriminator. On top of the challenges discussed in the previous section, GANs require careful user driven configuration tuning which may not guarantee same performance for every run. Some techniques to make the generator agnostic to described problems have been discussed in [27]. The other alternative to the same has given rise to non-adversarial learning of generative networks [4, 11]. Both [4, 11] showed that properties of convolutional GANs can be mimicked using simple reconstruction losses while discarding the discriminator.

While there has been some work on image generation from learned latent vectors [4, 11], our work significantly differs from these methods as we do not map all the frames pixel-wise in a given video to the same latent distribution. This is because doing so would require a separate latent space (hence, a separate model for each video) for all the videos in a given dataset, and performing any operation in that space would naturally become video specific. Instead, we divide the latent space of videos sharing the same class into two parts - static and transient. This gives us a dictionary of static latent vectors for all videos and a common transient latent subspace. Hence, any random video of the dataset can now be represented by the combination of one static vector (which remains same for all frames) and the common transient subspace.

3 Formulation

Define a video clip represented by frames as . Corresponding to each frame , let there be a point in latent space such that

(1)

which forms a path of length . We propose to disentangle a video into two parts: a static constituent, which captures the constant portion of the video common for all frames, and a transient constituent which represents the temporal dynamics between all the frames in the video. Hence, let be decomposed as where represents the static subspace and represents the transient subspace with . Thus in (1) can be expressed as . Next assuming that the video is of short length, we can fix for all frames after sampling only once. Therefore, (1) can be expressed as

(2)

The transient portion will represent the motion of a given video. Intuitively, the latent vectors corresponding to this transient state should be correlated, or in other words, will form a path between and . Specifically, the frames in a video are correlated in time and hence a frame at time is a function of all previous frames . As a result, their corresponding transient representation should also exhibit such a trajectory. This kind of representation of latent vectors can be obtained by employing a Recurrent Neural Network (RNN) where output of each cell of the network is a function of its previous state or input. Denote the RNN as with weights . Then, the RNN output is a sequence of correlated variables representing the transient state of the video.

3.1 Learning Network Weights

Define a generative network with weights represented by . takes latent vectors sampled from as input and predicts up to frames of the video clip. For a set of videos, initialize set of D-dimensional vectors to form the pair {(), (), , () }. More specifically from (2), defining , and , we will have the pairs

With these pairs, we propose to optimize the weights , , and input latent vectors (sampled once in the beginning of training) in the following manner. For each video , we jointly optimize for , and

for every epoch in two stages:

(.1)
(.2)

can be any regression-based loss. For rest of the paper, we will refer to both (.1) and (.2) together as .

Regularized loss function to capture static subspace.

The transient subspace, along with the RNN, handles the temporal dynamics of the video clip. To equally capture the static portion of the video, we randomly choose a frame from the video and ask the generator to compare its corresponding generated frame during training. For this, we update the above loss as follows.

(4)

where with , is the ground truth frame, , and is the regularization constant. can also be understood to essentially handle the role of image discriminator in [33, 35] that ensures that the generated frame is close to the ground truth frame.

3.2 Learning Latent Spaces

Non-adversarial learning involves joint optimization of network weights as well as the corresponding input latent space. Apart from the gradients with respect to loss in (4), we propose to further optimize the latent space with gradient of a loss based on the triplet condition as follows.

3.2.1 The Triplet Condition

Figure 3: Triplet Condition in the transient latent space. Latent code representation of different frames of short video clips may lie very near to each other in the transient subspace. Using the proposed triplet condition, our model learns to explain the dynamics of similar looking frames and simultaneously map them to distinct latent vectors.

Short video clips often have indistinguishable dynamics in consecutive frames which can force the latent code representations to lie very near to each other in the transient subspace. However, an ideal transient space should ensure that the latent vector representation of a frame should only be close to a similar frame than a dissimilar one [28, 29]. To this end, we introduce a triplet loss to (4) that ensures that a pair of co-occurring frames (anchor) and (positive) are closer but distinct to each other in embedding space than any other frame (negative) (see Fig. 3). In this work, positive frames are randomly sampled within a margin range of the anchor and negatives are chosen outside of this margin range. Defining a triplet set with transient latent code vectors {}, we aim to learn the transient embedding space such that

, where is the set of all possible triplets in . With the above regularization, the loss in (4) can be written as

(5)

where

is a hyperparameter that controls the margin while selecting positives and negatives.

3.3 Full Objective Function

For any choice of differentiable generator , the objective (4) will be differentiable with respect to , and [5]. We initialize

by sampling them from two different Gaussian distributions for both static and transient latent vectors. We also ensure that the latent vectors

lie on the unit sphere, and hence, we project after each update by dividing its value by [4], where returns maximum among the set of given elements. Finally, the complete objective function can be written as follows.

(6)

where , is a regularization constant for the triplet loss, and . The weights of the generator and static latent vector are updated by gradients of the losses and . The weights. , of the RNN, and transient latent vectors are updated by gradients of the losses , and .

3.3.1 Low Rank Representation for Interpolation

The objective of video frame interpolation is to synthesize non-existent frames in-between the reference frames. While the triplet condition ensures that similar frames have their transient latent vectors nearby, it doesn’t ensure that they lie on a manifold where simple linear interpolation will yield latent vectors that generate frames with plausible motion compared to preceding and succeeding frames [12, 4]. This means that the transient latent subspace can be represented in a much lower dimensional space compared to its larger ambient space. So, to enforce such a property, we project the latent vectors into a low dimensional space while learning them along with the network weights, first proposed in [12]. Mathematically, the loss in (6) can be written as

(7)

where indicates rank of the matrix and is a hyper-parameter that decides what manifold is to be projected on. We achieve this by reconstructing matrix from its top singular vectors in each iteration [7]. Note that, we only employ this condition for optimizing the latent space for the frame interpolation experiments in Sec. 4.3.3.

4 Experiments

In this section, we present extensive experiments to demonstrate the effectiveness of our proposed approach in generating videos through learned latent spaces.

4.1 Datasets

We evaluate the performance of our approach using three publicly available datasets which have been used in many prior works [10, 34, 33].

Chair-CAD [2]. This dataset consists of total 1393 chair-CAD models, out of which we randomly choose 820 chairs for our experiments with the first 16 frames, similar to [10]. The rendered frame in each video for all the models are center-cropped and then resized to pixels. We obtain the transient latent vectors for all the chair models with one static latent vectors for the training set.

Weizmann Human Action [9]. This dataset provides 10 different actions performed by 9 people, amounting to 90 videos. Similar to Chair-CAD, we center-cropped each frame, and then resized to pixels. For this dataset, we train our model to obtain nine static latent vectors (for nine different identities) and ten transient latent vectors (for ten different actions) for videos with 16 frames each.

Golf scene dataset [34]. Golf scene dataset [34] contains 20,268 golf videos with pixels which further has 583,508 short video clips in total. We randomly chose 500 videos with 16 frames each and resized the frames to pixels. Same as the Chair-CAD dataset, we obtained the transient latent vectors for all the golf scenes and one static latent vector for the training set.

4.2 Experimental Settings

We implement our framework in PyTorch

[22]111Code: https://github.com/abhishekaich27/Navsynth/. Please see supplementary material for details on implementation and values of different hyper-parameters (, etc.).

Network Architecture. We choose DCGAN [23] as the generator architecture for the Chair-CAD and Golf scene dataset, and conditional generator architecture from [21]

for the Weizmann Human Action dataset for our experiments. For the RNN, we employ a one-layer gated recurrent unit network with 500 hidden units

[6].

Choice of Loss Function for and . One straight forward loss function that can be used is the mean squared loss, but it has been shown in literature that it leads to generation of blurry pixels [37]. Moreover, it has been shown empirically that generative functions in adversarial learning focus on edges [4]. Motivated by this, the loss function for and is chosen to be the Laplacian pyramid loss [18] defined as

where is the l-th level of the Laplacian pyramid representation of the input.

Baselines. We compare our proposed method with two adversarial methods. For Chair-CAD and Weizmann Human Action, we use MoCoGAN [33] as the baseline, and for Golf scene dataset, we use VGAN [34] as the baseline. We use the publicly available code for MoCoGAN and VGAN, and set the hyper-parameters as recommended in the published work. We also compare two different versions of the proposed method by ablating the proposed loss functions. Note that, we couldn’t compare our results with Video-VAE [10] using our performance measures (described below) as the implementation has not been made available by the authors, and to the best of our efforts we couldn’t reproduce the results provided by them.

Performance measures. Past video generation works have been evaluated quantitatively on Inception score (IS) [10]

. But, it has been shown that IS is not a good evaluation metric for pixel domain generation, as the maximal IS score can be obtained by synthesizing a video from every class or mode in the given data distribution

[20, 3, 32]. Moreover, a high IS does not guarantee any confidence on the quality of generation, but only on the diversity of generation. Since a generative model trained using our proposed method can generate all videos using the learned latent dictionary222Direct video comparison seems straight forward for our approach as the corresponding one-to-one ground truth is known. However, for [33, 34], we do not know which video is being generated (action may be known e.g. [33]) which makes such direct comparison infeasible and unfair., and for a fair comparison with baselines, we use the following two measures, similar to measures provided in [33]. We also provide relevant bounds computed on real videos for reference. Note that arrows indicate whether higher or lower scores are better.

(1) Relative Motion Consistency Score (MCS ): Difference between consecutive frames captures the moving components, and hence motion in a video. So, firstly each frame in the generated video, as well as the ground-truth data, is represented as a feature vector computed using a VGG16 network [31]

pre-trained on ImageNet

[25] at the relu3_3 layer. Secondly, the averaged consecutive frame-feature difference vector for both set of videos is computed, denoted by and respectively. Finally, the relative MCS is then given by .

(2) Frame Consistency Score (FCS ): This score measures the consistency of the static portion of the generated video frames. We keep the first frame of the generated video as reference and compute the averaged structural similarity measure for all frames. The FCS is then given by the average of this measure over all videos.

4.3 Qualitative Results

Fig. 5 shows some examples with randomly selected frames of generated videos for the proposed method and the adversarial approaches MoCoGAN [33] and VGAN [34]. For Chair-CAD [2] and Weizmann Human Action [9] dataset, it can be seen that the proposed method is able to generate visually good quality videos with a non-adversarial training protocol, whereas MoCoGAN produces blurry and inconsistent frames. Since we use optimized latent vectors unlike MoCoGAN (which uses random latent vectors for video generation), our method produces visually more appealing videos. Fig. 5 presents two particularly important points. As visualized for the Chair-CAD videos, the adversarial approach of MoCoGAN produces not only blurred chair images in the generated video, but they are also non-uniform in quality. Further, it can be seen that as the time step increases, MoCoGAN tends to generate the same chair for different videos. This shows a major drawback of the adversarial approaches, where they fail to learn the diversity of the data distribution. Our approach overcomes this by producing a optimized dictionary of latent vectors which can be used for generating any video in the data distribution easily. To further validate our method for qualitative results, we present the following experiments.

4.3.1 Qualitative Ablation Study

Fig. 4 qualitatively shows the contribution of the specific parts of the proposed method on Chair-CAD [2]. First, we investigate the impact of input latent vector optimization. For a fair comparison, we optimize the model for same number of epochs. It can be observed that the model benefits from the joint optimization of input latent space to produce better visual results. Next, we validate the contribution of and on a difficult video example whose chair color matches with the background. Our method, combined with and , is able to distinguish between the white background and the white body of the chair model.

Figure 4: Qualitative ablation study on Chair-CAD [2]. Left: It can be seen that the model is not able to generate good quality frames properly resulting in poor videos when the input latent space is not optimized, whereas with latent optimization, the generated frames are sharper. Right: The impact of and is indicated by the red bounding boxes. Our method with and captures the difference between the white background and white chair, whereas without these two loss functions, the chair images are not distinguishable from their background. + and - indicate presence and absence of the terms, respectively.
(a) Chair-CAD [2]
(b) Weizmann Human Action [9]
(c) Golf [34]
Figure 5: Qualitative results comparison with state-of-the-art methods. We show two generated video sequences for MoCoGAN [33] (for (a) Chair-CAD [2], (b) Weizmann Human action [9]), VGAN [34] (for (c) Golf scene [34]) (top), and the proposed method (Ours, bottom). The proposed method produces visually sharper, and consistently better using the non-adversarial training protocol. More examples have been provided in the supplementary material.
MCS FCS
Bound 0.0 0.91
MoCoGAN [33] 4.11 0.85
Ours () 3.83 0.77
Ours () 3.32 0.89
(a) Chair-CAD [2]
MCS FCS
Bound 0.0 0.95
MoCoGAN [33] 3.41 0.85
Ours () 3.87 0.79
Ours () 2.63 0.90
(b) Weizmann Human Action [9]
MCS FCS
Bound 0.0 0.97
VGAN [34] 3.61 0.88
Ours () 3.78 0.84
Ours () 2.71 0.84
(c) Golf [34]
Table 2: Quantitative results comparison with state-of-the-art methods. We obtained better scores on the proposed method on both Chair-CAD [2], Weizmann Human Action [9], and Golf [34] datasets, compared to the adversarial approaches (MoCoGAN, and VGAN). Best scores have been highlighted in bold.
Actions Identities
P1 P2 P3 P4 P5 P6 P7 P8 P9
run
walk
jump
skip
Table 3: Generating videos by exchanging unseen actions by identities. Each cell in this table indicates a video in the dataset. Only cells containing the symbol indicate that the video was part of the training set. We randomly generated videos corresponding to rest of the cells indicated by symbols , , , and , visualized in Fig. 6.

4.3.2 Action Exchange

Our non-adversarial approach can effectively separate the static and transient portion of a video, and generate videos unseen during the training protocol. To validate these points, we choose a simple matrix completion for the combination of identities and actions in the Weizmann Human action [9] dataset. For training our model, we created a set of videos (without any cropping to present the complete scale of the frame) represented by the cells marked with in Tab. 3.

Figure 6: Examples of action exchange to generate unseen videos. This figure shows the generated videos unseen during the training of the model with colored bounding boxes indicating the colored dots ( , , , ) referred to in Tab. 3. This demonstrates the effectiveness of our method in disentangling static and transient portion of videos.

Hence, the unseen videos correspond to the cells not marked with . During testing, we randomly generated these unseen videos (marked with , , and in Tab. 3), and the visual results are shown in Fig. 6. This experiment clearly validates our claim of static (identities) and transient (action) portion disentanglement of a video and, generation of unseen videos by using combinations of action and identities not part of training set.

4.3.3 Frame Interpolation

To show our methodology can be employed for frame interpolation, we trained our model using the loss (3.3.1) for and . During testing, we generated intermediate frames by interpolating learned latent variables of two distinct frames. For this, we computed the difference between the learned latent vectors of second () and fifth () frame, and generated unseen frames using , after concatenating with . Fig. 7 shows the results of interpolation between second and fifth frames for two randomly chosen videos. Thus, our method is able to produce dynamically consistent frames with respect to the reference frames without any pixel clues.

Figure 7: Examples of frame interpolation. An important advantage of our method is translation of interpolation in the learned latent space to video space using (3.3.1). It can be observed that as increases, the interpolation (bounded by color) is better. Note that the adjacent frames are also generated frames, and not ground truth frames.

4.4 Quantitative Results

Quantitative result comparisons with respect to baselines have been provided in Tab. 2. Compared to videos generated by the adversarial method MoCoGAN [33], we report a relative decrease of 19.22% in terms of MCS, and 4.70% relative increase in terms of FCS for chair-CAD dataset [2]. For the Weizmann Human Action [9] dataset, the proposed method is reported to have a relative decrease of 22.87% in terms of in terms of MCS, and 4.61% relative increase in terms of FCS. Similarly for Golf scene dataset [34], we perform competitively with VGAN [34] with a observed relative decrease of 24.90% in terms of in terms of MCS. A important conclusion from these results is that our proposed method, being non-adversarial in nature, learns to synthesize a diverse set of videos, and is able to perform at par with adversarial approaches. It should be noted that a better loss function for and would produce stronger results. We leave this for future works.

4.4.1 Quantitative Ablation Study

MCS FCS
Bound 0 0.91
3.96 0.75
3.32 0.89
(a) With respect to latent space optimization.
MCS FCS
Bound 0 0.91
3.83 0.77
3.82 0.85
3.36 0.81
3.32 0.89
(b) With respect to loss functions
Table 4: Ablation study of proposed method on Chair-CAD [2]. In (a), we evaluate contributions of latent space optimization (). In (b), we evaluate contributions of and in four combinations. + and - indicate presence and absence of the terms, respectively.

In this section, we demonstrate the contribution of different components in our proposed methodology on the Chair-CAD [2] dataset. For all the experiments, we randomly generate 500 videos using our model by using the learned latent vector dictionary. We divide the ablation study into two parts. Firstly, we present the results for impact of the learned latent vectors on the network modules. For this, we simply generate videos once with the learned latent vectors (), and once with randomly sampled latent vectors from a different distribution (). The inter-dependency of our model weights and the learned latent vectors can be interpreted from Tab. 3(a). We see that there is a relative decrease of 16.16% in MCS from 3.96 to 3.32, and 18.66% of relative increase in FCS. This shows that optimization of the latent space in the proposed method is important for good quality video generation.

Secondly, we investigate the impact of the proposed losses on the proposed method. Specifically, we look into four possible combinations of and . The results are presented in Tab. 3(b). It can observed that the combination of triplet loss and static loss provides the best result when employed together, indicated by the relative decrease of 14.26% in MCS from 3.83 to 3.32.

5 Conclusion

We present a non-adversarial approach for synthesizing videos by jointly optimizing both network weights and input latent space. Specifically, our model consists of a global static latent variable for content features, a frame specific transient latent variable, a deep convolutional generator, and a recurrent neural network which are trained using a regression-based reconstruction loss, including a triplet based loss. Our approach allows us to generate a diverse set of almost uniform quality videos, perform frame interpolation, and generate videos unseen during training. Experiments on three standard datasets show the efficacy of our proposed approach over state-of-the-methods.

Acknowledgements. The work was partially supported by NSF grant 1664172 and ONR grant N00014-19-1-2264.

References

  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein Generative Adversarial Networks. In

    Proceedings of the International Conference on Machine Learning

    ,
    pp. 214–223. Cited by: §1.
  • [2] M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic (2014) Seeing 3D Chairs: Exemplar part-based 2D-3D Alignment using a Large Dataset of CAD Models. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3762–3769. Cited by: Table 5, Figure 1, Figure 8, §A, Table 6, 11(a), Figure 9, §C, Figure 4, 4(a), Figure 5, §4.1, §4.3.1, §4.3, §4.4.1, §4.4, 1(a), Table 2, Table 4.
  • [3] S. Barratt and R. Sharma (2018) A Note on the Inception Score. arXiv preprint arXiv:1801.01973. Cited by: §4.2.
  • [4] P. Bojanowski, A. Joulin, D. Lopez-Pas, and A. Szlam (2018) Optimizing the Latent Space of Generative Networks. In Proceedings of the International Conference on Machine Learning, pp. 599–608. Cited by: §1, §2.2, §2.2, §3.3.1, §3.3, §4.2.
  • [5] A. Bora, A. Jalal, E. Price, and A. G. Dimakis (2017) Compressed Sensing using Generative Models. In Proceedings of the International Conference on Machine Learning, pp. 537–546. Cited by: §3.3.
  • [6] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555. Cited by: §4.2.
  • [7] J. Friedman, T. Hastie, and R. Tibshirani (2001) The Elements of Statistical Learning. Vol. 1, Springer Series in Statistics New York. Cited by: §3.3.1.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative Adversarial Networks. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §1.
  • [9] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri (2007) Actions as Space-Time Shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (12), pp. 2247–2253. Cited by: Table 5, §A, Table 6, Figure 10, 11(b), §C, 4(b), Figure 5, §4.1, §4.3.2, §4.3, §4.4, 1(b), Table 2.
  • [10] J. He, A. Lehrmann, J. Marino, G. Mori, and L. Sigal (2018) Probabilistic Video Generation using Holistic Attribute Control. In Proceedings of the European Conference on Computer Vision, pp. 452–467. Cited by: §1, §1, §2.1, Table 1, §4.1, §4.1, §4.2, §4.2.
  • [11] Y. Hoshen, K. Li, and J. Malik (2019) Non-Adversarial Image Synthesis with Generative Latent Nearest Neighbors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5811–5819. Cited by: §1, §2.2, §2.2.
  • [12] R. Hyder and M. S. Asif (2020) Generative Models for Low-Dimensional Video Representation and Reconstruction. IEEE Transactions on Signal Processing 68 (), pp. 1688–1701. External Links: Document, ISSN 1941-0476 Cited by: §1, §3.3.1.
  • [13] D. P. Kingma and J. Ba (2014) ADAM: a Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Cited by: §B.
  • [14] Y. Kwon and M. Park (2019) Predicting Future Frames using Retrospective Cycle-GAN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1811–1820. Cited by: §1.
  • [15] J. Li, A. Madry, J. Peebles, and L. Schmidt (2017) Towards Understanding the Dynamics of Generative Adversarial Networks. arXiv preprint arXiv:1706.09884. Cited by: §1.
  • [16] K. Li and J. Malik (2018)

    Implicit Maximum Likelihood Estimation

    .
    arXiv preprint arXiv:1809.09087. Cited by: §1.
  • [17] X. Liang, L. Lee, W. Dai, and E. P. Xing (2017) Dual-Motion GAN for Future-Flow Embedded Video Prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1744–1752. Cited by: §1.
  • [18] H. Ling and K. Okada (2006) Diffusion Distance for Histogram Comparison. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 246–253. Cited by: §4.2.
  • [19] W. Lotter, G. Kreiman, and D. Cox (2016)

    Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

    .
    arXiv preprint arXiv:1605.08104. Cited by: §1.
  • [20] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet (2018) Are GANs created equal? A Large-scale Study. In Advances in Neural Information Processing Systems, pp. 700–709. Cited by: §4.2.
  • [21] M. Mirza and S. Osindero (2014) Conditional Generative Adversarial Networks. arXiv preprint arXiv:1411.1784. Cited by: Figure 8, §4.2.
  • [22] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic Differentiation in PyTorch. In NIPS AutoDiff Workshop, Cited by: §B, §4.2.
  • [23] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434. Cited by: §4.2.
  • [24] S. Ruder (2016) An Overview of Gradient Descent Optimization Algorithms. arXiv preprint arXiv:1609.04747. Cited by: §B.
  • [25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) ImageNet: Large-Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §4.2.
  • [26] M. Saito, E. Matsumoto, and S. Saito (2017)

    Temporal Generative Adversarial Nets with Singular Value Clipping

    .
    In Proceedings of the IEEE International Conference on Computer Vision, pp. 2830–2839. Cited by: 2nd item, §1, §1, §2.1, Table 1, §B.
  • [27] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems, pp. 2234–2242. Cited by: §1, §2.2.
  • [28] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    FACENET: a Unified Embedding for Face Recognition and Clustering

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Cited by: §B, §3.2.1.
  • [29] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain (2018)

    Time-Contrastive Networks: Self-Supervised Learning from Video

    .
    In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 1134–1141. Cited by: §B, §3.2.1.
  • [30] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe (2019) Animating Arbitrary Objects via Deep Motion Transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2377–2386. Cited by: §1, §2.1.
  • [31] K. Simonyan and A. Zisserman (2014) Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.2.
  • [32] L. Theis, A. van den Oord, and M. Bethge (2016) A Note on the Evaluation of Generative Models. In Proceedings of the International Conference on Learning Representations, pp. 1–10. Cited by: §4.2.
  • [33] S. Tulyakov, M. Liu, X. Yang, and J. Kautz (2018) MOCOGAN: Decomposing Motion and Content for Video Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535. Cited by: Figure 1, 2nd item, §1, §1, §2.1, Table 1, §B, Figure 9, §3.1, Figure 5, §4.1, §4.2, §4.2, §4.3, §4.4, 1(a), 1(b), footnote 2.
  • [34] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating Videos with Scene Dynamics. In Advances In Neural Information Processing Systems, pp. 613–621. Cited by: Table 5, Figure 8, 2nd item, §1, §1, §A, §2.1, Table 1, Table 6, §B, Figure 11, §C, 4(c), Figure 5, §4.1, §4.1, §4.2, §4.3, §4.4, 1(c), Table 2, footnote 2.
  • [35] T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro (2018) Video-to-Video Synthesis. In Advances in Neural Information Processing Systems, pp. 1144–1156. Cited by: §B, §3.1.
  • [36] T. Wang, Y. Cheng, C. Hubert Lin, H. Chen, and M. Sun (2019) Point-to-Point Video Generation. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1.
  • [37] H. Zhao, O. Gallo, I. Frosio, and J. Kautz (2015) Loss Functions for Neural Networks for Image Processing. arXiv preprint arXiv:1511.08861. Cited by: §4.2.

A Dataset Descriptions

Figure 8: Sample videos from datasets used in the paper. Two unprocessed video examples from Chair-CAD [2], Weizmann Human Action [21], and Golf Scene [34] datasets have been presented here. As seen from the examples, datasets are diverse in nature, different in categories and present unique challenges in learning the transient and static portions of the videos. Best viewed in color.

Chair-CAD [2]. This dataset provides 1393 chair-CAD models. Each model frame sequence is produced using two elevation angles in addition to thirty one azimuth angles. All the chair models have been designed to be at a fixed distance with respect to the camera. The authors provide four video sequences per CAD model. We choose the first 16 frames of each video for our paper, and consider the complete dataset as one class.

Weizmann Human Action [9]. This dataset is a collection of 90 video sequences showing nine different identities performing 10 different actions, namely, run, walk, skip, jumping-jack (or ‘jack’), jump-forward-on-two-legs (or ‘jump’), jump-in-place-on-two-legs (or ‘pjump’), gallopsideways (or ‘side’), wave-two-hands (or ‘wave2’), waveone-hand (or ‘wave1’), and bend. We randomly choose 16 consecutive frames for every video in each iteration during training.

Golf Scene [34]. [34] released a dataset containing 35 million clips (32 frames each) stabilized by SIFT+RANSAC. It contains several categories filtered by a pre-trained Place-CNN model, one of them being the Golf scenes. The Golf scene dataset contains 20,268 golf videos. Due to many non-golf videos being part of the golf category (due to inaccurate labels), this dataset presents a particularly challenging data distribution for our proposed method. Note that for a fair comparison, we further selected our training set videos from this provided dataset pertaining to golf action as close as possible. We then trained the VGAN [34] model on this selected videos for a fair comparison.

B Implementation Details

We used Pytorch [22] for our implementation. The Adam optimizer [13], with , and , was used to update the model weights and SGD optimizer [24], with momentum , was used to update the latent spaces. The corresponding learning rate for the generator , the RNN , and the latent spaces were set as values indicated in Tab. 6.

Hyper-parameters. [33, 34, 26] that generate videos from latent priors have no dataset split as the task is to synthesize high quality videos from the data distribution, and then evaluate the model performance. All hyperparameters (except , ) are set as described in [29, 28, 33, 35] (e.g. from [29]). For and , we follow the strategy used in Sec. 4.3 of [33] and observe that our model generates videos with good visual quality (FCS) and plausible motion (MCS) for Chair-CAD when (, ) = (206, 50). Same strategy is used for all datasets. The hyper-parameters employed with respect to each dataset used in this paper is given in Tab. 6. and refer to the generator, with weights , and RNN, with weight , respectively. represents the learning rate. represents the number of epochs. and refer to the static and transient latent dimensions, respectively. and refer to the static loss, and triplet loss regularization constants, respectively. is the margin for triplet loss. refers to the level of the Laplacian pyramid representation used in and .

Datasets Hyper-parameters
Chair-CAD [2] 206 50 0.01 0.01 2 12.5 5 300 4
Weizmann Human Action [9] 56 200 0.01 0.1 2 12.5 5 700 3
Golf Scene [34] 56 200 0.01 0.01 2 0.1 0.1 12.5 10 1000 4
Table 6: Hyper-parameters used in all experiments for all datasets.

Other details. We performed all our experiments on a system with 48 core Intel(R) Xeon(R) Gold 6126 processor with 256GB RAM. We used NVIDIA GeForce RTX 2080 Ti for all GPU computations during training. Further, NVIDIA Tesla K40 GPUs were used for computation of all evaluation metrics in our experiments. All our implementations are based on non-optimized PyTorch based codes. Our runtime analysis revealed that it took on average one to two days to train the model and obtain learned latent vectors.

C More Qualitative Examples

In this section, we provide more qualitative results of generated videos synthesized using our proposed approach on each dataset (Fig. 9 for Chair-CAD [2] dataset, Fig. 10 for Weizmann Human Action [9] dataset, and Fig. 11 for Golf scene [34] dataset). We also provide more examples interpolation experiment in Fig. 12.

Figure 9: Qualitative results on Chair-CAD [2]. On this large scale dataset, our model is able to capture the intrinsic rotation and color of videos unique to each chair model. This shows the efficacy of our approach, compared to adversarial approaches such as MoCoGAN [33] which produce the same chair for all videos, with blurry frames (See Fig. 1 of main manuscript).
Figure 10: Qualitative results on Weizmann Human Action [9]. The videos show that our model produces sharp visual results with the combination of trained generator, RNN along with 9 identities, and 10 different action latent vectors.
Figure 11: Qualitative results on Golf Scene [34]. Our proposed approach produces visually good results on this particularly challenging dataset. Due to incorrect labels on the videos, this dataset has many non-golf videos. Our model is still able to capture the static and transient portion of the videos, although better filtering can still improve our results.
(a) Chair-CAD [2]
(b) Weizmann Human Action [9]
Figure 12: More interpolation results. In this figure, represents the rank of transient latent vectors . We present the interpolation results on (a) Chair-CAD dataset, and (b) Weizmann Human Action, for different values of . It can be observed that as increases, the interpolation becomes clearer.