Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein GANs

by   Dinesh Acharya, et al.

The extension of image generation to video generation turns out to be a very difficult task, since the temporal dimension of videos introduces an extra challenge during the generation process. Besides, due to the limitation of memory and training stability, the generation becomes increasingly challenging with the increase of the resolution/duration of videos. In this work, we exploit the idea of progressive growing of Generative Adversarial Networks (GANs) for higher resolution video generation. In particular, we begin to produce video samples of low-resolution and short-duration, and then progressively increase both resolution and duration alone (or jointly) by adding new spatiotemporal convolutional layers to the current networks. Starting from the learning on a very raw-level spatial appearance and temporal movement of the video distribution, the proposed progressive method learns spatiotemporal information incrementally to generate higher resolution videos. Furthermore, we introduce a sliced version of Wasserstein GAN (SWGAN) loss to improve the distribution learning on the video data of high-dimension and mixed-spatiotemporal distribution. SWGAN loss replaces the distance between joint distributions by that of one-dimensional marginal distributions, making the loss easier to compute. We evaluate the proposed model on our collected face video dataset of 10,900 videos to generate photorealistic face videos of 256x256x32 resolution. In addition, our model also reaches a record inception score of 14.57 in unsupervised action recognition dataset UCF-101.



There are no comments yet.


page 35

page 37

page 38

page 39

page 40


STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction

Although many video prediction methods have obtained good performance in...

Progressive Growing of GANs for Improved Quality, Stability, and Variation

We describe a new training methodology for generative adversarial networ...

HRVGAN: High Resolution Video Generation using Spatio-Temporal GAN

In this paper, we present a novel network for high resolution video gene...

ArrowGAN : Learning to Generate Videos by Learning Arrow of Time

Training GANs on videos is even more sophisticated than on images becaus...

Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition

A major emerging challenge is how to protect people's privacy as cameras...

Label-Conditioned Next-Frame Video Generation with Neural Flows

Recent state-of-the-art video generation systems employ Generative Adver...

SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos

In this paper, we introduce SoccerNet, a benchmark for action spotting i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1.1 Focus of this Work

The main focus of the work is on unsupervised generation of higher resolution videos. Such an endeavour has several challenges. First, there is lack of sufficient large resolution video datasets. Next, generating larger videos incurs severe memory and computational constraints. Network training and convergence is also highly unstable. We address these issue with following contributions:

  • Progressive growing of video generative adversarial networks

  • Improved loss function for better stability

  • Novel facial dynamics dataset with video clips

1.2 Thesis Organization

In chapter 2, readers will be introduced to basic ideas behind GANs and recent advances for stable training of GANs. In chapter 3, relevant literature will be reviewed and important models will be discussed in details. In chapter 4, proposed techniques for spatio-temporal growing of gans will be discussed. Sliced Wasserstein GAN loss for stable training will be presented in chapter 5. In chapter 6, various metrices used in this work for evaluating and comparing our model against other models will be presented. In chapter 7, we will discuss about novel dataset collected for this work. Additional datasets used for evaluating our model will be also be mentioned. Qualitative and quantitative comparision of our model with existing models will be presented in chapter 8. In chapter 9, we will discuss our findings and conclusions.

2.1 Gan

Figure 2.1: Standard GAN Architecture.

Generative Adversarial Networks (GANs)[10] are unsupervised generative models that learn to generate samples from a given distribution in adversarial manner. The network architecture consists of generator and discriminator (in some cases also called critic). Given a random noise as input, the generator , where is latent space dimension, tries to generate samples from a given distribution. The discriminator tries to distinguish whether the generated sample is from a given distribution or not. The loss function is designed so that generator’s goal is to generate samples that fool the discriminator. Similarly, the discriminator’s goal is designed to avoid being fooled by the generator. As such, GANs can be interpreted as non-cooperative games. Let be generator and be discriminator. Then the loss function proposed in [10] is given by:


where is latent code, is data sample,

is probability distribution over latent space and

is probability distribution over data samples. The two-player minimax game is then given by:


In early years, discriminators were trained with sigmoid cross entropy loss as they were trained as classifiers for real and generated data. However, it was argued in


, that such GANs suffer from vanishing gradient problem. Instead, in

[28], least squares GANs were proposed that used least squares loss to train discriminators instead of sigmoid cross entropy loss.

Despite some improvements from least squares loss[28], GANs still suffered from several issues such as instability in training, mode collapse and lack of convergence. In [35], authors proposed various techniques such as feature matching and minibatch discrimination. In feature matching, activations of intermediate layers of discriminator are used to guide the generator. Formally, the new generator loss is given by:


where replaces the traditional in the form of a feature extractor rather than a discriminator. The discriminator is trained as usual. Minibatch discrimination technique was proposed to address the issue of mode collapse. To accomplish this, additional statistics that models affinity of samples in a minibatch is concatenated to features in the discriminator. it is important to note that the summary statistics is learned during training through large projection matrices.

2.2 Wasserstein GAN

In [2], authors argue that optimization of GAN loss given in Eq. 2.1 is same as minimization of Jensen-Shannon (JS) divergence between distribution of generated and real samples. In case the two distributions have non-overlapping support, JS-divergence can have jumps. This leads to aforementioned optimization issues. For stable training of GANs, authors propose to minimize Wasserstein Distance (WD) between the distributions which behaves smoothly even in case of non-overlapping support. Formally, the primal form of WD is given by:


where are distributions of real and generated samples and is the space of all possible joint probability distributions of and . It is not feasible to explore all possible values of . So, authors propose to use the dual formulation which is better suited for approximation. Formally, the dual formulation of Eq. 2.4 is given by:


where is Lipschitz constraint. The GAN loss is then given by:


Here, the discriminator takes the form of feature extractor and is parametrized by .

is further required to be K-Lipschitz. In order to enforce the K-Lipschitz constraint, authors proposed weight clipping. However, as argued in the same paper, gradient clipping is a very rudimentary technique to enforce the Lipschitz constraint. In

[12], authors propose to penalize the norm of the gradient in order to enforce Lipschitz constraint. In particular, the new loss is defined as:


where is regularization parameter and is sampled uniformly from a straight line connecting real sample and generated sample.

2.3 Conditional GANs

It is relevant to briefly review Conditional GANs[30]. They provide a framework to enforce conditions on the generator in order to generate samples with desired property. Such conditions could be class labels or some portion of original data as in case of future prediction. Under the new setting, the GAN loss defined in Eq. 2.1 becomes


Concrete applications of Conditional GANs include generation of specific digits of MNIST dataset [30] or a face with specific facial expression, age or complexion in the context of images [6], or future prediction[23] in context of videos to name a few.

Over the years, GANs have found applications in numerous areas. To name a few, such applications include image-to-image translation

[52][17][53][6], 3D object generation[46][47]

, super resolution


, image inpainting

[44] etc.

3.1 Progressive Growing of GAN

As mentioned earlier, the basic idea behind progressive growing of GANs is to gradually increase the complexity of the problem [19]. To accomplish this, authors propose to first train the generators and discriminators on lower resolution samples. During training they propose to progressively introduce new layers to increasingly learn on more complex problem of generating higher resolution images.

Figure 3.1: Transition phase during growing of generator and discriminator networks under the progressive growing scheme.

As illustrated in Fig. 3.1

, successively new networks layers are introduced to generators and discriminators. The transition from one stage of training to another stage of training is made smooth by using linear interpolation during transition phase. The interpolation factor is then smoothly changed during training. At every stage, the output layer of the generator consists of a 1x1 convolutions that map feature channels to RGB image. Similarly, the first layer of discriminator consists of 1x1 convolutions that map RGB image to feature channels. During the transition step, a linear interpolation of output of 1x1 convolutions from lower resolution feature channels and 1x1 convolutions from higher resolution feature channels is taken as output of generator. The scalar factor

corresponding to output of higher resolution feature channels is smoothly increased from to . Similarly, during transition, both higher resolution and downscaled images are provided as input to different input layers of discriminator. Learning on simpler problem and gradually increasing complexity of the problem for both discriminator and generator can be expected to lead to faster convergence and stability. Authors claim in the paper that the improvement with proposed training scheme is orthogonal to improvements arising from loss functions. The idea of progressive growing has not yet been applied to video generation. In this work, we explore progressively growing of video generative networks.

3.2 Video GAN

In [43], authors propose a parallel architecture for unsupervised video generation. The architecture consists of two parallel streams consisting of 2D and 3D convolution layers for the generator and single stream 3D convolution layers for discriminator. As illustrated in Figure 3.2, the two stream architecture was designed to untangle foreground and background in videos.

Figure 3.2: Two stream architecture of the generator of Video GAN. Video GAN assumes stable background in the video clips.

If be the mask that selectes either foreground or background, the output of generator, , at pixel is given by:


where is output of background stream and is output of foreground stream. In case of background stream, the same value of is replicated over all time frames. Experimental results presented by authors supports the use of two stream architecture. However, one of the strong assumptions of the model is that of static background.

3.3 Temporal GAN

In [34], authors propose a cascade architecture for unsupervised video generation. As illustrated in Fig. 3.3, the proposed architecture consisting of temporal and image generator. Temporal generator, which consists of 1-D deconvolution layers, maps input latent code to a set of new latent codes corresponding to frames in the video. Each new latent code and the original latent code together are then fed to a new image generator. The resulting frames are then concatenated together to obtain a video. For the discriminator, TGAN uses single stream 3D convolution layers.

Figure 3.3: Temporal GAN uses a cascade architecture to generate videos. Temporal generator uses 1-D deconvolutions and spatial generator uses 2-D deconvolutions.

Unlike in the case of Video GAN, this model makes no assumption about separation of background and foreground stream. As such, no requirement on background stabilization of videos is assumed.

3.4 MoCoGAN

Motion Content GAN (MoCoGAN) network architecture is similar to TemporalGAN (TGAN) [34] in the sense it also has cascade architecture. Furthermore, it also uses temporal and image generators and 3D convolution layers based discriminator. However, unlike TGAN, temporal generator on MoCoGAN is based on Recurrent Neural Network (RNN) and the input to temporal generator is a set of latent variables. Furthermore, the outputs of temporal generator, motion codes, are concatenated with newly sampled content code to feed image generators. In discriminator, there is an additional image based discriminator. The authors claim that using such architecture helps to separate motion and content from videos.

3.5 Other Related Works

Video Pixel Networks (VPN) [18] build on the work of PixelCNNs [41]

for future prediction. In particular, they estimate probability distribution of raw pixel values in video using resolution preserving CNN encoders and PixelCNN decoders. Other works for future prediction include

[29]. Recently, optical flow based models have produced more realistic results. In particular, in [31] authors use flow and texture GAN to model optical flow and texture in videos. Several of the works have also focused on future prediction which is a slightly different problem than unsupervised video generation[23][29].

4.1 Transition Phase

Figure 4.2: Transition phase during which new layers are introduced to both generator and discriminator.

During each phase, the final layer of generator consists of convolution filters that map input feature channels to RGB videos. The discriminator in the similar fashion consists of convolution filters that map input RGB videos to feature channels. While transitioning from one resolution to another resolution, new convolution layers are introduced to both discriminator and generator symmetrically to generate larger resolution videos. During transition from one level of detail to another level of detail, generator outputs videos of two different resolutions. The lower resolution videos are upscaled with nearest-neighbor upscaling. The linear combination of the upscaled video and higher resolution video is then fed to discriminator. The weight corresponding to higher resolution video generated by generator is smoothly increased from 0 to 1 and that corresponding to upscaled video is gradually decreased from 1 to 0. New layers are introduced in discriminator in similar manner.

4.2 Minibatch Standard Deviation

One way to avoid mode collapse is to use feature statistics of different samples within the minibatch and penalize the closeness of those features [12]. In this approach, the feature statistics are learned through parameters of projection matrices that summarize input activations [19][2]. Instead, following [19]

, standard deviation of individual features from each spatio-temporal location across the minibatch is computed and then averaged. Thus obtained single summary statistics is concatenated to all spatio-temporal location and features of the minibatch.

Figure 4.3:

Illustration of different steps of minibatch standard deviation layer: (a) feature vectors at each pixel across the minibatch, (b) standard deviation computation of each feature vector, (c) average operation, (d) replication and (e) concatenation.

Since there are no additional learnable parameters, this approach is computationally cheaper and yet as argued in [19], efficient.

4.3 Pixel Normalization

Following [19] and in the direction of local response normalization proposed in [24], normalization of feature vector at each pixel avoids explosion of parameters in generator and discriminator. The pixel feature vector normalization proposed in [19] can be naturally extended to spatio-temporal case. In particular, if and be original and normalized feature vector at pixel corresponding to spatial and temporal position,


where and is number of feature maps. Though pixel vector normalization may not necessarily improve performance, it does avoid explosion of parameters in the network.

6.1 Inception Score

Inception score was originally proposed in [35] for evaluation of GANs. In the paper, the authors argued that Inception Score correlated well with the visual quality of generated samples. Let be samples generated by the generator . be the distribution of classes for generated samples and be the marginal class distribution:


The Inception score is defined as:



is the Kullback-Leibler divergence between

and .

In practice, the marginal class distribution is approximated with:


where is number of samples generated.

Intuitively, maximum Inception Score is obtained when generated samples can be clearly classified as belonging to one of the classes in training set and the distribution of samples belonging to different classes is as uniform as possible. This encourages realistic samples and discourages mode collapse. The idea behind Inception score has been generalized to the context of video generation. In [34] authors propose to use C3D model trained on Sports-1M dataset and finetuned on UCF101 dataset. It is important to point out that Inception score computation requires a model trained on specific classification problem and corresponding data. Furthermore Inception score does not compare the statistics of generated samples directly with statistics of real samples [13].

6.2 Fréchet Inception Distance

Alternative measure to access the quality of generated samples was proposed in [13]. In the paper, authors propose to use pre-trained networks as feature extractors to extract low level features from both real and generated samples. If be the CNN used to extract features,

be mean and covariance of features extracted from real samples and

be mean and covariance of features extracted from fake samples with , then the Fréchet distance is defined as


FID was shown to correlate well to visual perception quality [13]. Since FID directly compares the summary statistics of generated samples and real samples, it can be considered to be more accurate than Inception score. Furthermore, as lower level features are used to compute FID score, it can be used to evaluate generative models for any dataset.

Similar to Inception score, FID can be generalized to compare video generative models. In particular, as C3D is standard model widely used in video recognition tasks, a C3D model trained on action recognition dataset can be used as feature extractor. Since output of final pooling layer is very high dimensional in case of C3D, output of first fully connected layer can be used to resonably compare the quality of generated samples.

7.1 Trailer Face Dataset

Figure 7.1: Pipeline for Construction of Dataset.

Large chunk of GAN papers evaluate and compare the generative models on facial datasets [19][13][10] such as CelebA[26] in case of images and MUG dataset[1] or YouTube Faces[45] in case of videos [40][16]. However, there is lack of publicly available high resolution datasets containing facial dynamics.

Dataset Resolution (Aligned) Sequences Wild Labels Diverse Dynamics
TrailerFaces 300x300 10,911
YoutubeFaces 100x100 3,425 Identity
AFEW - 1,426 Expressions
MUG 896x896 1,462 Expressions
Table 7.1: Comparision of our TrailerFaces dataset with existing datasets containing facial dynamics.

In terms of resolution too, widely used datasets for video generation such as Golf and Aeroplane datasets too are only 128x128 resolution. UCF101 is widely used for evaluation of generative models. Though it contains 240x320 resolution samples, due to relatively small number of samples per class, learning meaningful features is not possible. Aeroplane and Golf datasets contain too diverse videos. Learning meaningful representation from such videos can be difficult for networks. Hence a novel dataset of human facial dynamics was collected from movie trailers.

Number of Frames 30-33 34-39 40-47 48-57 58-69 70-423
Total clips 1781 3106 2291 1591 940 1201
Table 7.2: Total number of clips with given number of frames.

Our motivation to use movie trailers for dataset collection was motivated by the fact movie trailers highlight dramatic and emotionally charged scenes. Unlike whole movies, interviews or TV series, trailers contain scenes where stronger emotional response of actors are highlighted. Furthermore using trailers of thousands of movies increases the gender, racial and age-wise diversity of the faces in the clips. Approximately complete Hollywood movie trailers were downloaded from YouTube. Number of SIFT feature matches between corresponding frames was used for shot boundary detection. Approximately

shots were detected in those trailers. After splitting trailers into individual shots, those with too few or too many frames were removed. Face-detection was carried out to filter-out clips where at least 31 consecutive frames do not contain any faces. For face detection Haar-cascade based face detection tool from Open-CV was used. After detection of faces, Deep Alignment Network

[22] was used for extraction of -point facial landmark. Thus obtained facial landmarks were used for alignment using similarity transform. This was observed to be more stable across temporal dimension compared to state-of-art techniques like MTCNN. Finally, consecutive frames from those shots on which face detection was successful were selected. SIFT feature matching was again used to remove clips containing different personalities across frames.

Figure 7.2: Samples from TrailerFaces Dataset. Random frames were chosen for visualization.

7.2 UCF101 Dataset

UCF101 dataset [36] was originally collected for action recognition tasks. It contains 13320 videos from 101 different action categories. Some of the action categories in the videos include Sky Diving, Knitting and Baseball Pitch. In [34] video based inception score was proposed for evaluation of quality of video generative models. As argued in [34], Inception score computation requires a dataset with class labels and a standard model trained for classification. For training, first training split of UCF101 dataset with 9,537 video samples was used.

Figure 7.3: Samples from UCF101 dataset. Random frames were selected for visualization.

7.3 Golf and Aeroplane Datasets

Golf and Aeroplane datasets contain 128x128 resolution datasets that can be used for evaluating video generative adversarial networks. Golf dataset in particular was used in [34][43][23]. Both of these datasets contain videos in the wild. Golf dataset contains more than half a million clips. We used the background stabilized clips for training our model. Aeroplane dataset contains more than 300,000 clips that are not background stabilized.

Figure 7.4: Samples from Golf (right) and Aeroplane (left) datasets. Random frame was selected for visualization.

8.1 Qualitative Results

Figure 8.1: Improvement in resolution of generated videos over time on TrailerFaces dataset. Single frames from each video clips were selected

As discussed earlier, progressive growing scheme was utilized for training the video generative networks. The improvement in quality and level of details over the course of training is illustrated in Fig. 8.1. As seen from the figure, more detailed structures appear in the images over the course of training. Furthermore, the generated images look reasonable on TrailerFaces dataset. However, the quality is still not comparable to the quality of generated samples in the case of images [19]. As illustrated in Fig. 8.2, Fig. 8.3 and Fig. 8.4, the structure of moving objects such aeroplanes, humans and animals is not distinct and they appear as blobs. Though appearance of dynamic objects is not well captured by the network, it can be inferred from Fig. 8.2 and Fig. 8.3 that temporal dynamics seems more reasonable.

To analyze if the network has overfitted the dataset, we carried out linear interpolation in latent space and generated samples. As seen from Fig. B.1,B.2,B.3,B.4,B.5,B.6, samples from all datasets show that our network has good generalization ability and does not overfit the model.

Figure 8.2: Qualitative comparision of samples from Aeroplane dataset generated by our method with that generated by Video GAN and Temporal GAN.
Figure 8.3: Qualitative comparision of samples from Golf dataset generated by our method with that generated by Video GAN and Temporal GAN.
Figure 8.4: Qualitative comparision of clips generated with progressive approach (top), Temporal GAN (bottom left) and Video GAN (bottom right) on aeroplane (left) and golf datasets (right).

It can be observed that progressive growing technique can generate higher resolution videos without mode collapse or instability issues traditionally suffered by GANs. However, the issue that moving objects appear as blobs in generated samples as reported in [43], is still not completely solved.

8.2 Inception score on UCF101

Same protocol and C3D model proposed in [34] was used for computation of inception scores. All scores except our own were taken from [34] and [40]. To train our model, central 32 frames from UCF101 were selected and then each frame was resized and cropped to . In our experiments, best Inception score of was obtained with a single, though progressive model. Furthermore, with SWGAN loss, we were able to obtain inception score of . Both of these scores are the best result we are aware of.

Model Inception Score
Progressive Video GAN
Progressive Video GAN + SWGAN 14.560.05
Maximum Possible
Table 8.1: Inception scores of Progressive Video GAN compared with other models on UCF101 dataset.

In all cases, first training split of UCF101 dataset was used. However, in [34] and [40], authors randomly sampled or consecutive frames during training. In our case, we restricted to central frames of video during training.

Surprisingly, inception score started decreasing on training the network further. One possible cause could be smaller minibatch size used at higher resolution. However, further experiment is necessary to make decisive conclusion about the behaviour.

8.3 Fid

In this section, we compare FID score of samples generated with our model and one generated with models from VideoGAN and TGAN papers. In C3D model, output of fc-6 layer is 4096-d where as output of pool-5 layer is 8192-d. The output of fc-6 layer of C3D model was used to compute FID score for computational reasons. In order to compute FID score, 10,000 samples were generated with each model. Since VideoGAN and TGAN models were trained to generate on 64x64 resolution videos, we upscaled the videos to in order to compute FID score.

Figure 8.5: Comparison of our models on UCF101 dataset based on FID Score (left) and Inception Score (right).
Figure 8.6: Comparison of our model with TGAN and VideoGAN based on Golf and Aeroplane Datasets as measured by FID score.
Model FID Score on Golf Dataset FID Score on Aeroplane Dataset
VGAN[43] 113007 149094
TGAN[34] 112029 120417
Progressive Video GAN 95544 102049
Table 8.2: Quantitative comparision of Progressive Video GAN with TGAN and VideoGAN based on FID score on Golf and Aeroplane datasets.

To report FID, TGAN and VideoGAN were trained on our own using code available on the internet. It is clear from both datasets that progressive video GAN performs significantly better than TGAN and VideoGAN. The difference is even more prominent in case of Aeroplane dataset where TGAN and Progressive Video GAN perform significantly better than VideoGAN. As mentioned earlier, Golf dataset was stabilized whereas Aeroplane dataset was not. This is easily explained by the fact that VideoGAN assumes stable background whereas TGAN and Progressive Video GAN make no such assumptions.

Appendix A Network Architecture

Generator Activation Output shape Parameters
Latent vector - 128x1x1x1 -
Fully-connected LReLU 8192x1x1x1 1.04m
Conv 3x3x3 LReLU 128x4x4x4 0.44m
Upsample - 128x8x8x8 -
Conv 3x3x3 LReLU 128x8x8x8 0.44m
Conv 3x3x3 LReLU 128x8x8x8 0.44m
Upsample - 128x8x16x16 -
Conv 3x3x3 LReLU 128x8x16x16 0.44m
Conv 3x3x3 LReLU 128x8x16x16 0.44m
Upsample - 128x8x32x32 -
Conv 3x3x3 LReLU 64x8x32x32 0.22m
Conv 3x3x3 LReLU 64x8x32x32 0.22m
Upsample - 64x16x64x64 -
Conv 3x3x3 LReLU 32x16x64x64 55k
Conv 3x3x3 LReLU 32x16x64x64 27k
Upsample - 32x16x128x128 -
Conv 3x3x3 LReLU 16x16x128x128 13.8k
Conv 3x3x3 LReLU 16x16x128x128 6.9k
Upsample - 16x32x256x256 -
Conv 3x3x3 LReLU 8x32x256x256 3.4k
Conv 3x3x3 LReLU 8x32x256x256 1.7k
Conv 1x1x1 LReLU 3x32x256x256 27
Total Parameters 3.7m
Table A.1: Generator architecture for generation of 256x256x32 videos.
Discriminator Activation Output shape Parameters
Input Image - 128x1x1 -
Conv 1x1x1 LReLU 128x4x4x4 32
Conv 3x3x3 LReLU 128x4x4x4 1.73k
Conv 3x3x3 LReLU 128x4x4x4 3.47k
Downsample - 128x8x8x8 -
Conv 3x3x3 LReLU 128x8x8x8 6.92k
Conv 3x3x3 LReLU 128x8x8x8 13.85k
Downsample - 128x8x16x16 -
Conv 3x3x3 LReLU 128x8x16x16 27.68k
Conv 3x3x3 LReLU 128x8x16x16 55.36k
Downsample - 128x8x32x32 -
Conv 3x3x3 LReLU 64x8x32x32 0.11m
Conv 3x3x3 LReLU 64x8x32x32 0.22m
Downsample - 64x16x64x64 -
Conv 3x3x3 LReLU 32x16x64x64 0.44k
Conv 3x3x3 LReLU 32x16x64x64 0.44k
Downsample - 32x16x128x128 -
Conv 3x3x3 LReLU 16x16x128x128 0.44m
Conv 3x3x3 LReLU 16x16x128x128 0.44m
Downsample - 16x32x256x256 -
Minibatch Stddev - 129x4x4x4 -
Conv 3x3x3 LReLU 8x32x256x256 .44m
Fully-connected linear 1x1x1x128 1.04m
Fully-connected linear 1x1x1x1 129
Total Parameters 3.7m
Table A.2: Discriminator architecture for generation of 256x256x32 videos.

Appendix B Latent Space Interpolations

Golf Dataset

Figure B.1: Linear interpolation in latent space to generate samples from Golf dataset - 1.
Figure B.2: Linear interpolation in latent space to generate samples from Golf dataset - 2

Aeroplane Dataset

Figure B.3: Linear interpolation in latent space to generate samples from Aeroplane dataset - 1
Figure B.4: Linear interpolation in latent space to generate samples from Aeroplane dataset - 2


Figure B.5: Linear interpolation in latent space to generate samples from TrailerFaces dataset - 1
Figure B.6: Linear interpolation in latent space to generate samples from TrailerFaces dataset - 2

Golf Dataset

Figure B.1: Linear interpolation in latent space to generate samples from Golf dataset - 1.
Figure B.2: Linear interpolation in latent space to generate samples from Golf dataset - 2

Aeroplane Dataset

Figure B.3: Linear interpolation in latent space to generate samples from Aeroplane dataset - 1
Figure B.4: Linear interpolation in latent space to generate samples from Aeroplane dataset - 2


Figure B.5: Linear interpolation in latent space to generate samples from TrailerFaces dataset - 1
Figure B.6: Linear interpolation in latent space to generate samples from TrailerFaces dataset - 2