Navsynth
None
view repo
Most of the existing works in video synthesis focus on generating videos using adversarial learning. Despite their success, these methods often require input reference frame or fail to generate diverse videos from the given data distribution, with little to no uniformity in the quality of videos that can be generated. Different from these methods, we focus on the problem of generating videos from latent noise vectors, without any reference input frames. To this end, we develop a novel approach that jointly optimizes the input latent space, the weights of a recurrent neural network and a generator through nonadversarial learning. Optimizing for the input latent space along with the network weights allows us to generate videos in a controlled environment, i.e., we can faithfully generate all videos the model has seen during the learning process as well as new unseen videos. Extensive experiments on three challenging and diverse datasets well demonstrate that our approach generates superior quality videos compared to the existing stateoftheart methods.
READ FULL TEXT VIEW PDF
Existing works address the problem of generating high framerate sharp v...
read it
Recent GANbased video generation approaches model videos as the combina...
read it
We study the problem of videotovideo synthesis, whose goal is to learn...
read it
In this paper, we introduce a new problem of manipulating a given video ...
read it
Large collections of videos are grouped into clusters by a topic keyword...
read it
We propose a new challenging task: procedure planning in instructional
v...
read it
In this work, we focus on a challenging task: synthesizing multiple imag...
read it
None
Video synthesis is an open and challenging problem in computer vision. As literature suggests, a deeper understanding of spatiotemporal behavior of video frame sequences can directly provide insights in choosing priors, future prediction, and feature learning
[34, 36]. Much progress has been made in developing variety of ways to generate videos which can be classified into broadly two categories: class of video generation methods which require random latent vectors without any reference input pixel
[34, 26, 33], and class of video generation methods which do depend on reference input pixels [36, 10, 30]. Current literature contains methods mostly from the second class which often requires some human intervention [36, 10].In general, Generative Adversarial Networks (GANs) [8] have shown remarkable success in various kinds of video modality problems [17, 14, 19, 26]. Initially, video generation frameworks predominantly used GANs to synthesize videos from latent noise vectors. For example, VGAN [34] and TGAN [26] proposed generative models that synthesize videos from random latent vectors with deep convolutional GAN. Recently, MoCoGAN [33] proposed to decompose a video into content and motion parts using a generator guided by two discriminators. During testing, these frameworks generate videos that are captured in the range of the trained generator, by taking random latent vectors. While all these methods have obtained reasonable performance on commonly used benchmark datasets, they utilize adversarial learning to train their models and hence, inherit the shortcomings of GANs. Specifically, GANs are often very sensitive to multiple factors such as random network initialization, and type of layers employed to build the network [15, 27]. Some infamous drawbacks of GANs are modecollapse (i.e., able to generate only some parts of the data distribution: see Fig. 1 for an example) and/or vanishing generator gradients due to discriminator being way better in distinguishing fake samples and real samples [1].
Nonadversarial approaches [4, 16, 11] have recently been explored to tackle these challenges. For example, Generative Latent Optimization (GLO) [4] and Generative Latent Nearest Neighbor (GLANN) [11] investigate the importance of inductive bias in convolutional networks by disconnecting the discriminator for a nonadversarial learning protocol of GANs. These works show that without a discriminator, a generator can be learned to map the training images in the given data distribution to a lower dimensional latent space that is learned in conjunction with the weights of the generative network. Such procedure not only avoids the modecollapse problem of the generators, but also provides the user an optimized low dimensional latent representation (embedding) of the data in contrast with the random latent space as in GANs. Recently VideoVAE [10] proposed to use Variational AutoEncoder (VAE) for conditional video synthesis, either by randomly generating or providing the first frame to the model for synthesizing a video. However, the quality of generated videos using VideoVAE often depends on the provided input frame. Nonadversarial video synthesis without any visual inputs still remains as a novel and rarely addressed problem.
In this paper, we propose a novel nonadversarial framework to generate videos in a controllable manner without any reference frame. Specifically, we propose to synthesize videos from two optimized latent spaces, one providing control over the static portion of the video (static latent space) and the other over the transient portion of the video (transient latent space). We propose to jointly optimize these two spaces while optimizing the network (a generative and a recurrent network) weights with the help of regressionbased reconstruction loss and a triplet loss.
Our approach works as follows. During training, we jointly optimize over network weights and latent spaces (both static and transient) and obtain a common transient latent space, and individual static latent space dictionary for all videos sharing the same class (see Fig. 2
). During testing, we randomly choose a static vector from the dictionary, concatenate it with the transient latent vector and generate a video. This enables us to obtain a controlled environment of diverse video generation from learned latent vectors for each video in the given dataset, while maintaining almost uniform quality. In addition, the proposed approach also allows a concise video data representation in form of learned vectors, frame interpolation (using a low rank constraint introduced in
[12]), and generation of videos unseen during the learning paradigm.The key contributions of our work are as follows.
[leftmargin=*]
We propose a novel framework for generating a wide range of diverse videos from learned latent vectors without any conditional input reference frame with almost uniform visual quality. Our framework obtains a latent space dictionary on both static and transient portions for the training video dataset, which enables us to generate even unseen videos with almost equal quality by providing combinations of static and transient latent vectors that were not part of training data.
Our extensive experiments on multiple datasets well demonstrate that the proposed method, without the adversarial training protocol, has better or at par, performance with current stateoftheart methods [34, 26, 33]. Moreover, we do not need to optimize the (multiple) discriminator networks as in previous methods [34, 26, 33] which offers a computational advantage.
Our work relates to two major research directions: video synthesis and nonadversarial learning. In this section, we focus on some representative methods closely related to our work.
Video synthesis has been studied from multiple perspectives [34, 26, 33, 10, 30] (see Tab. 1 for a categorization of existing methods). VGAN [34] demonstrates that a video can be divided into foreground and background using deep neural networks. TGAN [26] proposes to use a generator to capture temporal dynamics by generating correlated latent codes for each video frame and then using an image generator to map each of these latent codes to a single frame for the whole video. MoCoGAN [33] presents a simple approach to separate content and motion latent codes of a video using adversarial learning. The most relevant work for us is VideoVAE [10] that extends the idea of image generation to video generation using VAE by proposing a structured latent space in conjunction with the VAE architecture for video synthesis. While this method doesn’t require a discriminator network, it depends on reference input frame to generate a video. In contrast, our method proposes a efficient framework for synthesizing videos from learnable latent vectors without any input frame. This gives a controlled environment for video synthesis that even enables us to generate visually good quality unseen videos through combining static and transient parts.
Methods  Settings  

Adversarial learning?  Input frame?  Input latent vectors?  
VGAN [34]  ✓  ✗  ✓(random) 
TGAN [26]  ✓  ✗  ✓(random) 
MoCoGAN [33]  ✓  ✗  ✓(random) 
VideoVAE [10]  ✗  ✓  ✓(random) 
Ours  ✗  ✗  ✓(learned) 
Generative adversarial networks, as powerful as they are in pixel space synthesis, are also difficult to train. This is owing to the saddlepoint based optimization game between the generator and the discriminator. On top of the challenges discussed in the previous section, GANs require careful user driven configuration tuning which may not guarantee same performance for every run. Some techniques to make the generator agnostic to described problems have been discussed in [27]. The other alternative to the same has given rise to nonadversarial learning of generative networks [4, 11]. Both [4, 11] showed that properties of convolutional GANs can be mimicked using simple reconstruction losses while discarding the discriminator.
While there has been some work on image generation from learned latent vectors [4, 11], our work significantly differs from these methods as we do not map all the frames pixelwise in a given video to the same latent distribution. This is because doing so would require a separate latent space (hence, a separate model for each video) for all the videos in a given dataset, and performing any operation in that space would naturally become video specific. Instead, we divide the latent space of videos sharing the same class into two parts  static and transient. This gives us a dictionary of static latent vectors for all videos and a common transient latent subspace. Hence, any random video of the dataset can now be represented by the combination of one static vector (which remains same for all frames) and the common transient subspace.
Define a video clip represented by frames as . Corresponding to each frame , let there be a point in latent space such that
(1) 
which forms a path of length . We propose to disentangle a video into two parts: a static constituent, which captures the constant portion of the video common for all frames, and a transient constituent which represents the temporal dynamics between all the frames in the video. Hence, let be decomposed as where represents the static subspace and represents the transient subspace with . Thus in (1) can be expressed as . Next assuming that the video is of short length, we can fix for all frames after sampling only once. Therefore, (1) can be expressed as
(2) 
The transient portion will represent the motion of a given video. Intuitively, the latent vectors corresponding to this transient state should be correlated, or in other words, will form a path between and . Specifically, the frames in a video are correlated in time and hence a frame at time is a function of all previous frames . As a result, their corresponding transient representation should also exhibit such a trajectory. This kind of representation of latent vectors can be obtained by employing a Recurrent Neural Network (RNN) where output of each cell of the network is a function of its previous state or input. Denote the RNN as with weights . Then, the RNN output is a sequence of correlated variables representing the transient state of the video.
Define a generative network with weights represented by . takes latent vectors sampled from as input and predicts up to frames of the video clip. For a set of videos, initialize set of Ddimensional vectors to form the pair {(), (), , () }. More specifically from (2), defining , and , we will have the pairs
With these pairs, we propose to optimize the weights , , and input latent vectors (sampled once in the beginning of training) in the following manner. For each video , we jointly optimize for , and
for every epoch in two stages:
(.1)  
(.2) 
can be any regressionbased loss. For rest of the paper, we will refer to both (.1) and (.2) together as .
Regularized loss function to capture static subspace.
The transient subspace, along with the RNN, handles the temporal dynamics of the video clip. To equally capture the static portion of the video, we randomly choose a frame from the video and ask the generator to compare its corresponding generated frame during training. For this, we update the above loss as follows.(4) 
where with , is the ground truth frame, , and is the regularization constant. can also be understood to essentially handle the role of image discriminator in [33, 35] that ensures that the generated frame is close to the ground truth frame.
Nonadversarial learning involves joint optimization of network weights as well as the corresponding input latent space. Apart from the gradients with respect to loss in (4), we propose to further optimize the latent space with gradient of a loss based on the triplet condition as follows.
Short video clips often have indistinguishable dynamics in consecutive frames which can force the latent code representations to lie very near to each other in the transient subspace. However, an ideal transient space should ensure that the latent vector representation of a frame should only be close to a similar frame than a dissimilar one [28, 29]. To this end, we introduce a triplet loss to (4) that ensures that a pair of cooccurring frames (anchor) and (positive) are closer but distinct to each other in embedding space than any other frame (negative) (see Fig. 3). In this work, positive frames are randomly sampled within a margin range of the anchor and negatives are chosen outside of this margin range. Defining a triplet set with transient latent code vectors {}, we aim to learn the transient embedding space such that
, where is the set of all possible triplets in . With the above regularization, the loss in (4) can be written as
(5) 
where
is a hyperparameter that controls the margin while selecting positives and negatives.
For any choice of differentiable generator , the objective (4) will be differentiable with respect to , and [5]. We initialize
by sampling them from two different Gaussian distributions for both static and transient latent vectors. We also ensure that the latent vectors
lie on the unit sphere, and hence, we project after each update by dividing its value by [4], where returns maximum among the set of given elements. Finally, the complete objective function can be written as follows.(6) 
where , is a regularization constant for the triplet loss, and . The weights of the generator and static latent vector are updated by gradients of the losses and . The weights. , of the RNN, and transient latent vectors are updated by gradients of the losses , and .
The objective of video frame interpolation is to synthesize nonexistent frames inbetween the reference frames. While the triplet condition ensures that similar frames have their transient latent vectors nearby, it doesn’t ensure that they lie on a manifold where simple linear interpolation will yield latent vectors that generate frames with plausible motion compared to preceding and succeeding frames [12, 4]. This means that the transient latent subspace can be represented in a much lower dimensional space compared to its larger ambient space. So, to enforce such a property, we project the latent vectors into a low dimensional space while learning them along with the network weights, first proposed in [12]. Mathematically, the loss in (6) can be written as
(7) 
where indicates rank of the matrix and is a hyperparameter that decides what manifold is to be projected on. We achieve this by reconstructing matrix from its top singular vectors in each iteration [7]. Note that, we only employ this condition for optimizing the latent space for the frame interpolation experiments in Sec. 4.3.3.
In this section, we present extensive experiments to demonstrate the effectiveness of our proposed approach in generating videos through learned latent spaces.
We evaluate the performance of our approach using three publicly available datasets which have been used in many prior works [10, 34, 33].
ChairCAD [2]. This dataset consists of total 1393 chairCAD models, out of which we randomly choose 820 chairs for our experiments with the first 16 frames, similar to [10]. The rendered frame in each video for all the models are centercropped and then resized to pixels. We obtain the transient latent vectors for all the chair models with one static latent vectors for the training set.
Weizmann Human Action [9]. This dataset provides 10 different actions performed by 9 people, amounting to 90 videos. Similar to ChairCAD, we centercropped each frame, and then resized to pixels. For this dataset, we train our model to obtain nine static latent vectors (for nine different identities) and ten transient latent vectors (for ten different actions) for videos with 16 frames each.
Golf scene dataset [34]. Golf scene dataset [34] contains 20,268 golf videos with pixels which further has 583,508 short video clips in total. We randomly chose 500 videos with 16 frames each and resized the frames to pixels. Same as the ChairCAD dataset, we obtained the transient latent vectors for all the golf scenes and one static latent vector for the training set.
We implement our framework in PyTorch
[22]^{1}^{1}1Code: https://github.com/abhishekaich27/Navsynth/. Please see supplementary material for details on implementation and values of different hyperparameters (, etc.).Network Architecture. We choose DCGAN [23] as the generator architecture for the ChairCAD and Golf scene dataset, and conditional generator architecture from [21]
for the Weizmann Human Action dataset for our experiments. For the RNN, we employ a onelayer gated recurrent unit network with 500 hidden units
[6].Choice of Loss Function for and . One straight forward loss function that can be used is the mean squared loss, but it has been shown in literature that it leads to generation of blurry pixels [37]. Moreover, it has been shown empirically that generative functions in adversarial learning focus on edges [4]. Motivated by this, the loss function for and is chosen to be the Laplacian pyramid loss [18] defined as
where is the lth level of the Laplacian pyramid representation of the input.
Baselines. We compare our proposed method with two adversarial methods. For ChairCAD and Weizmann Human Action, we use MoCoGAN [33] as the baseline, and for Golf scene dataset, we use VGAN [34] as the baseline. We use the publicly available code for MoCoGAN and VGAN, and set the hyperparameters as recommended in the published work. We also compare two different versions of the proposed method by ablating the proposed loss functions. Note that, we couldn’t compare our results with VideoVAE [10] using our performance measures (described below) as the implementation has not been made available by the authors, and to the best of our efforts we couldn’t reproduce the results provided by them.
Performance measures. Past video generation works have been evaluated quantitatively on Inception score (IS) [10]
. But, it has been shown that IS is not a good evaluation metric for pixel domain generation, as the maximal IS score can be obtained by synthesizing a video from every class or mode in the given data distribution
[20, 3, 32]. Moreover, a high IS does not guarantee any confidence on the quality of generation, but only on the diversity of generation. Since a generative model trained using our proposed method can generate all videos using the learned latent dictionary^{2}^{2}2Direct video comparison seems straight forward for our approach as the corresponding onetoone ground truth is known. However, for [33, 34], we do not know which video is being generated (action may be known e.g. [33]) which makes such direct comparison infeasible and unfair., and for a fair comparison with baselines, we use the following two measures, similar to measures provided in [33]. We also provide relevant bounds computed on real videos for reference. Note that arrows indicate whether higher or lower scores are better.(1) Relative Motion Consistency Score (MCS ): Difference between consecutive frames captures the moving components, and hence motion in a video. So, firstly each frame in the generated video, as well as the groundtruth data, is represented as a feature vector computed using a VGG16 network [31]
pretrained on ImageNet
[25] at the relu3_3 layer. Secondly, the averaged consecutive framefeature difference vector for both set of videos is computed, denoted by and respectively. Finally, the relative MCS is then given by .(2) Frame Consistency Score (FCS ): This score measures the consistency of the static portion of the generated video frames. We keep the first frame of the generated video as reference and compute the averaged structural similarity measure for all frames. The FCS is then given by the average of this measure over all videos.
Fig. 5 shows some examples with randomly selected frames of generated videos for the proposed method and the adversarial approaches MoCoGAN [33] and VGAN [34]. For ChairCAD [2] and Weizmann Human Action [9] dataset, it can be seen that the proposed method is able to generate visually good quality videos with a nonadversarial training protocol, whereas MoCoGAN produces blurry and inconsistent frames. Since we use optimized latent vectors unlike MoCoGAN (which uses random latent vectors for video generation), our method produces visually more appealing videos. Fig. 5 presents two particularly important points. As visualized for the ChairCAD videos, the adversarial approach of MoCoGAN produces not only blurred chair images in the generated video, but they are also nonuniform in quality. Further, it can be seen that as the time step increases, MoCoGAN tends to generate the same chair for different videos. This shows a major drawback of the adversarial approaches, where they fail to learn the diversity of the data distribution. Our approach overcomes this by producing a optimized dictionary of latent vectors which can be used for generating any video in the data distribution easily. To further validate our method for qualitative results, we present the following experiments.
Fig. 4 qualitatively shows the contribution of the specific parts of the proposed method on ChairCAD [2]. First, we investigate the impact of input latent vector optimization. For a fair comparison, we optimize the model for same number of epochs. It can be observed that the model benefits from the joint optimization of input latent space to produce better visual results. Next, we validate the contribution of and on a difficult video example whose chair color matches with the background. Our method, combined with and , is able to distinguish between the white background and the white body of the chair model.
Actions  Identities  
P1  P2  P3  P4  P5  P6  P7  P8  P9  
run  
walk  
jump  
skip 
Our nonadversarial approach can effectively separate the static and transient portion of a video, and generate videos unseen during the training protocol. To validate these points, we choose a simple matrix completion for the combination of identities and actions in the Weizmann Human action [9] dataset. For training our model, we created a set of videos (without any cropping to present the complete scale of the frame) represented by the cells marked with in Tab. 3.
Hence, the unseen videos correspond to the cells not marked with . During testing, we randomly generated these unseen videos (marked with , , and in Tab. 3), and the visual results are shown in Fig. 6. This experiment clearly validates our claim of static (identities) and transient (action) portion disentanglement of a video and, generation of unseen videos by using combinations of action and identities not part of training set.
To show our methodology can be employed for frame interpolation, we trained our model using the loss (3.3.1) for and . During testing, we generated intermediate frames by interpolating learned latent variables of two distinct frames. For this, we computed the difference between the learned latent vectors of second () and fifth () frame, and generated unseen frames using , after concatenating with . Fig. 7 shows the results of interpolation between second and fifth frames for two randomly chosen videos. Thus, our method is able to produce dynamically consistent frames with respect to the reference frames without any pixel clues.
Quantitative result comparisons with respect to baselines have been provided in Tab. 2. Compared to videos generated by the adversarial method MoCoGAN [33], we report a relative decrease of 19.22% in terms of MCS, and 4.70% relative increase in terms of FCS for chairCAD dataset [2]. For the Weizmann Human Action [9] dataset, the proposed method is reported to have a relative decrease of 22.87% in terms of in terms of MCS, and 4.61% relative increase in terms of FCS. Similarly for Golf scene dataset [34], we perform competitively with VGAN [34] with a observed relative decrease of 24.90% in terms of in terms of MCS. A important conclusion from these results is that our proposed method, being nonadversarial in nature, learns to synthesize a diverse set of videos, and is able to perform at par with adversarial approaches. It should be noted that a better loss function for and would produce stronger results. We leave this for future works.


In this section, we demonstrate the contribution of different components in our proposed methodology on the ChairCAD [2] dataset. For all the experiments, we randomly generate 500 videos using our model by using the learned latent vector dictionary. We divide the ablation study into two parts. Firstly, we present the results for impact of the learned latent vectors on the network modules. For this, we simply generate videos once with the learned latent vectors (), and once with randomly sampled latent vectors from a different distribution (). The interdependency of our model weights and the learned latent vectors can be interpreted from Tab. 3(a). We see that there is a relative decrease of 16.16% in MCS from 3.96 to 3.32, and 18.66% of relative increase in FCS. This shows that optimization of the latent space in the proposed method is important for good quality video generation.
Secondly, we investigate the impact of the proposed losses on the proposed method. Specifically, we look into four possible combinations of and . The results are presented in Tab. 3(b). It can observed that the combination of triplet loss and static loss provides the best result when employed together, indicated by the relative decrease of 14.26% in MCS from 3.83 to 3.32.
We present a nonadversarial approach for synthesizing videos by jointly optimizing both network weights and input latent space. Specifically, our model consists of a global static latent variable for content features, a frame specific transient latent variable, a deep convolutional generator, and a recurrent neural network which are trained using a regressionbased reconstruction loss, including a triplet based loss. Our approach allows us to generate a diverse set of almost uniform quality videos, perform frame interpolation, and generate videos unseen during training. Experiments on three standard datasets show the efficacy of our proposed approach over stateofthemethods.
Acknowledgements. The work was partially supported by NSF grant 1664172 and ONR grant N000141912264.
Proceedings of the International Conference on Machine Learning
, pp. 214–223. Cited by: §1.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 3762–3769. Cited by: Table 5, Figure 1, Figure 8, §A, Table 6, 11(a), Figure 9, §C, Figure 4, 4(a), Figure 5, §4.1, §4.3.1, §4.3, §4.4.1, §4.4, 1(a), Table 2, Table 4.Implicit Maximum Likelihood Estimation
. arXiv preprint arXiv:1809.09087. Cited by: §1.Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning
. arXiv preprint arXiv:1605.08104. Cited by: §1.Temporal Generative Adversarial Nets with Singular Value Clipping
. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2830–2839. Cited by: 2nd item, §1, §1, §2.1, Table 1, §B.FACENET: a Unified Embedding for Face Recognition and Clustering
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Cited by: §B, §3.2.1.TimeContrastive Networks: SelfSupervised Learning from Video
. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 1134–1141. Cited by: §B, §3.2.1.ChairCAD [2]. This dataset provides 1393 chairCAD models. Each model frame sequence is produced using two elevation angles in addition to thirty one azimuth angles. All the chair models have been designed to be at a fixed distance with respect to the camera. The authors provide four video sequences per CAD model. We choose the first 16 frames of each video for our paper, and consider the complete dataset as one class.
Weizmann Human Action [9]. This dataset is a collection of 90 video sequences showing nine different identities performing 10 different actions, namely, run, walk, skip, jumpingjack (or ‘jack’), jumpforwardontwolegs (or ‘jump’), jumpinplaceontwolegs (or ‘pjump’), gallopsideways (or ‘side’), wavetwohands (or ‘wave2’), waveonehand (or ‘wave1’), and bend. We randomly choose 16 consecutive frames for every video in each iteration during training.
Golf Scene [34]. [34] released a dataset containing 35 million clips (32 frames each) stabilized by SIFT+RANSAC. It contains several categories filtered by a pretrained PlaceCNN model, one of them being the Golf scenes. The Golf scene dataset contains 20,268 golf videos. Due to many nongolf videos being part of the golf category (due to inaccurate labels), this dataset presents a particularly challenging data distribution for our proposed method. Note that for a fair comparison, we further selected our training set videos from this provided dataset pertaining to golf action as close as possible. We then trained the VGAN [34] model on this selected videos for a fair comparison.
We used Pytorch [22] for our implementation. The Adam optimizer [13], with , and , was used to update the model weights and SGD optimizer [24], with momentum , was used to update the latent spaces. The corresponding learning rate for the generator , the RNN , and the latent spaces were set as values indicated in Tab. 6.
Hyperparameters. [33, 34, 26] that generate videos from latent priors have no dataset split as the task is to synthesize high quality videos from the data distribution, and then evaluate the model performance. All hyperparameters (except , ) are set as described in [29, 28, 33, 35] (e.g. from [29]). For and , we follow the strategy used in Sec. 4.3 of [33] and observe that our model generates videos with good visual quality (FCS) and plausible motion (MCS) for ChairCAD when (, ) = (206, 50). Same strategy is used for all datasets. The hyperparameters employed with respect to each dataset used in this paper is given in Tab. 6. and refer to the generator, with weights , and RNN, with weight , respectively. represents the learning rate. represents the number of epochs. and refer to the static and transient latent dimensions, respectively. and refer to the static loss, and triplet loss regularization constants, respectively. is the margin for triplet loss. refers to the level of the Laplacian pyramid representation used in and .
Datasets  Hyperparameters  

ChairCAD [2]  206  50  0.01  0.01  2  12.5  5  300  4  
Weizmann Human Action [9]  56  200  0.01  0.1  2  12.5  5  700  3  
Golf Scene [34]  56  200  0.01  0.01  2  0.1  0.1  12.5  10  1000  4 
Other details. We performed all our experiments on a system with 48 core Intel(R) Xeon(R) Gold 6126 processor with 256GB RAM. We used NVIDIA GeForce RTX 2080 Ti for all GPU computations during training. Further, NVIDIA Tesla K40 GPUs were used for computation of all evaluation metrics in our experiments. All our implementations are based on nonoptimized PyTorch based codes. Our runtime analysis revealed that it took on average one to two days to train the model and obtain learned latent vectors.
In this section, we provide more qualitative results of generated videos synthesized using our proposed approach on each dataset (Fig. 9 for ChairCAD [2] dataset, Fig. 10 for Weizmann Human Action [9] dataset, and Fig. 11 for Golf scene [34] dataset). We also provide more examples interpolation experiment in Fig. 12.
Comments
There are no comments yet.