Log In Sign Up

StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2

by   Ivan Skorokhodov, et al.

Videos show continuous events, yet most - if not all - video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be - time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator. For this, we first design continuous motion representations through the lens of positional embeddings. Then, we explore the question of training on very sparse videos and demonstrate that a good generator can be learned by using as few as 2 frames per clip. After that, we rethink the traditional image and video discriminators pair and propose to use a single hypernetwork-based one. This decreases the training cost and provides richer learning signal to the generator, making it possible to train directly on 1024^2 videos for the first time. We build our model on top of StyleGAN2 and it is just 5 resolution while achieving almost the same image quality. Moreover, our latent space features similar properties, enabling spatial manipulations that our method can propagate in time. We can generate arbitrarily long videos at arbitrary high frame rate, while prior work struggles to generate even 64 frames at a fixed rate. Our model achieves state-of-the-art results on four modern 256^2 video synthesis benchmarks and one 1024^2 resolution one. Videos and the source code are available at the project website:


page 1

page 2

page 3

page 6

page 7


Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks

In the deep learning era, long video generation of high-quality still re...

Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization

We introduce an approach to generating videos based on a series of given...

A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Image and video synthesis are closely related areas aiming at generating...

Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

Videos are created to express emotion, exchange information, and share e...

Learning Representations of Endoscopic Videos to Detect Tool Presence Without Supervision

In this work, we explore whether it is possible to learn representations...

Non-Adversarial Video Synthesis with Learned Priors

Most of the existing works in video synthesis focus on generating videos...

Jointly Trained Image and Video Generation using Residual Vectors

In this work, we propose a modeling technique for jointly training image...

Code Repositories


Source code for StyleGAN-V

view repo

1 Introduction

Figure 1: Examples of 1-hour long videos, generated with different methods. MoCoGAN-HD [MoCoGAN-HD] fails to generate long videos due to the instability of the underlying LSTM model when unrolled to large lengths. DIGAN [DIGAN] struggles to generate long videos due to the entanglement of spatial and temporal positional embeddings. StyleGAN-V (our method) generates plausible videos of arbitrary length and frame-rate. Also, unlike DIGAN, it learns temporal patterns not only in terms of motion, but also appearance transformations, like time of day and weather changes.
Figure 2: Our model enjoys all the perks of StyleGAN2 [StyleGAN2], including the ability of semantic manipulation. In this example, we edited a generated frame (top row) or projected off-the-shelf image (bottom row) with CLIP and animated it with our model. To the best of our knowledge, our work is the first one which demonstrates such capabilities for video generators.

Recent advances in deep learning pushed image generation to the unprecedented photo-realistic quality

[StyleGAN2-ADA, BigGAN] and spawned a lot of its industry applications. Video generation, however, does not enjoy a similar success and struggles to fit complex real-world datasets. The difficulties are caused not only by the more complex nature of the underlying data distribution, but also due to the computationally intensive video representations employed by modern generators. They treat videos as discrete sequences of images, which is very demanding for representing long high-resolution videos and induces the use of expensive conv3d-based architectures to model them [TGAN, MoCoGAN, TGANv2, DVD_GAN]. 222E.g., DVD-GAN [DVD_GAN] requires K to train at resolution (as reported by [MoCoGAN-HD])

In this work, we argue that this design choice is not optimal and propose to treat videos in their natural form: as continuous signals , that map any time coordinate into an image frame . Consequently, we develop a GAN-based continuous video synthesis framework by extending the recent paradigm of neural representations [NeRF, SIREN, FourierFeatures] to the video generation domain.

Developing such a framework comes with three challenges. First, sine/cosine positional embeddings are periodic by design and depend only on the input coordinates. This does not suit video generation, where temporal information should be aperiodic (otherwise, videos will be cycled) and different for different samples. Next, since videos are perceived as infinite continuous signals, one needs to develop an appropriate sampling scheme to use them in a practical framework. Finally, one needs to accordingly redesign the discriminator to operate in the new sampling pipeline.

To solve the first issue, we develop positional embeddings with time-varying wave parameters which depend on motion information, sampled uniquely for different videos. This motion information is represented as a sequence of motion codes produced by a padding-less conv1d-based model. We prefer it over the usual LSTM network [MoCoGAN, MoCoGAN-HD, TGANv2, VideoGLO] to alleviate the RNN’s instability when unrolled to large depths and to produce frames non-autoregressively.

Next, we investigate the question of how many samples are needed to learn a meaningful video generator. We argue that it can be learned from extremely sparse videos (as few as 2 frames per clip), and justify it with a simple theoretical exposition in Sec 3.3 and practical experiments (see Table 2).

Finally, since our model sees only 2-4 randomly sampled frames per video, it is highly redundant to use expensive conv3d-blocks in the discriminator, which are designed to operate on long sequences of equidistant frames. That’s why we replace it with a conv2d-based model, which aggregates information temporarily via simple concatenation and is conditioned on the time distances between its input frames. We use hypernetwork-based modulation [Hypernetworks] for this conditioning to make the discriminator more flexible in processing frames sampled at varying time distances. Such redesign improves training efficiency (see Table 1), provides more informative gradient signal to the generator (see Fig 4) and simplifies the overall pipeline (see Sec 3.2), since we no longer need two different discriminators to operate on image and video levels separately, as modern video synthesis models do (e.g., [MoCoGAN, TGANv2, DVD_GAN]).

We build our model, named StyleGAN-V, on top of the image-based StyleGAN2 [StyleGAN2]. It is able to produce arbitrarily long videos at arbitrarily high frame-rate in a non-autoregressive manner and enjoys great training efficiency — it is only costlier than the classical image-based StyleGAN2 model [StyleGAN2], while having only worse plain image quality in terms of FID [FID] (see Fig 3). This allows us to easily scale it to HQ datasets and we demonstrate that it is directly trainable on resolution.

For empirical evaluation, we use 5 benchmarks: FaceForensics  [FaceForensics_dataset], SkyTimelapse  [SkyTimelapse_dataset]

, UCF101

 [UCF101_dataset], RainbowJelly (introduced in our work) and MEAD  [MEAD_dataset]. Apart from our model, we train from scratch 5 different methods and measure their performance using the same evaluation protocol. Frechet Video Distance (FVD) [FVD] serves as the main metric for video synthesis, but there is no complete official implementation for it (see Sec 4 and Appx C). This leads to discrepancies in the evaluation procedures used by different works because FVD, similarly to FID [FID], is very sensitive to data format and sampling strategy [FID_evaluation]. That’s why we implement, document and release our complete FVD evaluation protocol. In terms of sheer metrics, our method performs on average better than the closest runner-up.

2 Related work

Video synthesis. Early works on video synthesis mainly focused on video prediction [SiftFlow, UnsupervisedVisualPrediction], i.e. generating future frames given a sequence of the previously seen ones. Early approaches for this problem typically employed recurrent convolutional models trained with reconstruction objective [video_language_modeling, Robot_pushing_dataset, LSTMs_video_representations], but later adversarial losses were introduced to improve the synthesis quality [MultiScaleVideoPrediction, Video2Video, GeneratingVideosWithSceneDynamics]. Some recent works explore autoregressive video prediction with recurrent or attention-based models (e.g., [VideoTransformer, LVT, VideoGPT, PredictingVideoWithVQVAE, VideoPixelNetworks]). Another close line of research is

video interpolation

, i.e. increasing the frame rate of a given video (e.g., [Video_interpolation_ASC, Video_interpolation_DAIN, Video_interpolation_SuperSloMo]). In our work, we study video generation, which is a more challenging problem than video prediction since it seeks to synthesize videos from scratch, i.e. without using the expressive conditioning on previous frames. Classical methods in this direction are typically based on GANs [GANs]. MoCoGAN [MoCoGAN] and TGAN [TGAN] decompose generator’s input noise into a content code and motion codes, which became a standard strategy for many subsequent works (e.g., [MoCoGAN-HD, TGANv2, VideoGLO, TemporalShiftGAN]). SVGAN [SelfSupervisedVideoGANs] additionally add self-supervision losses to improve the synthesis.

MoCoGAN-HD [MoCoGAN-HD] and StyleVideoGAN [StyleVideoGAN], similar to us, consider high-resolution video synthesis. But in their case, the authors perform indirect training by training a motion codes generator in the latent space of a pretrained StyleGAN2 model. StyleGAN-V is trained on extremely sparse videos. This makes it related to [TGANv2, Hierarchical_video_generation, Inmodegan], which use a pyramid of discriminators operating on different temporal resolutions (with a subsampling factor of up to ). In contrast to the prior work, our generator is continuous in time. In this way it is similar to Vid-ODE[VidODE]: a continuous-time video interpolation and prediction model based on neural ODEs [NODE].

Figure 3: FID scores and training cost by FVD for modern video generators on FaceForensics  [FaceForensics_dataset]. Our method (denoted by ) shows that video generators can be as efficient and as good in terms of image quality as traditional image based generators (like, StyleGAN2 [StyleGAN2], denoted with the dashed line). FID for video generators was computed from 16-frames long videos.

To the best of our knowledge, all modern video synthesis approaches utilize expensive conv3d blocks either in their decoder and/or encoder components (e.g., [MoCoGAN, MoCoGAN-HD, TGANv2, TemporalShiftGAN, DVD_GAN, LDVDGAN, ProVGAN]). Often, GAN-based approaches utilize two discriminators, operating on image and video levels independently, where the video discriminator operates at a low resolution to save computation (e.g., [MoCoGAN, G3AN, MoCoGAN-HD, DVD_GAN]). In our work, we show that it’s enough to use a single holistic hypernetwork-based [Hypernetworks] discriminator conditioned on the time difference between frames to build a state-of-the-art temporarily coherent video generator. For conditioning, we use ModConv2d blocks from [StyleGAN2], which is similar to AnyCostGAN [AnyCostGAN].

Neural Representations

. Neural representations is a recent paradigm that uses neural networks to represent continuous signals, such as images, videos, audios, 3D objects and scenes (e.g.,

[NeRF, SIREN, FourierFeatures, SRNs, TemplateImplicitFunction]). It is mostly popular for 3D reconstruction and geometry processing tasks (e.g., [DeepSDF, DeepMeta, OccupancyNetworks, ConvolutionalOccupancyNetworks, DVR]), including video-based reconstruction [Nerfies, SpaceTimeNeIF, D-nerf, NSFF]. Several recent projects explored the task of building generative models over such representations to synthesize images (e.g., [INR-GAN, CIPS, ALIS]), 3D objects (e.g., [GRAF, piGAN, NeRF-VAE]) or multi-modal signals (e.g., [INRs_distribution, GEM]), and our work extends this line of research to video generation.

Concurrent works. The development of neural representations-based approaches moves extremely fast and there are two concurrent works which propose ideas similar to our ones. DIGAN[DIGAN] is a concurrent project that explores the same direction of using neural-based representations for continuous video synthesis and shares a lot of ideas with our work. The authors also consider a continuous-time generator, trained by a discriminator without conv3d layers. The core difference with our work is that they use a different parametrization of motions and use a dual discriminator : one operates on and the second one on individual images. We enumerate the differences and similarities in Appx H. NeRV [NeRV] is another concurrent project which proposes to represent videos as convolutional neural representations. But in their case, the authors explore compression and denoising tasks. GEM [GEM] utilizes generative latent optimization [GLO] to build a multi-modal generative model.

3 Model

Figure 4: Visualizing the gradient signal to at of training from conv3d-based discriminator of MoCoGAN-HD (upper row) and our hypernetwork-based (lower row) at timesteps.

Our model is based on the paradigm of neural representations [NeRF, SIREN, FourierFeatures], i.e. representing signals as neural networks. We treat each video as a function which is continuous in time . In this manner, the training dataset is a set of subsampled signals , where denotes the total number of videos, denotes the time position of the -th frame and is the amount of frames in the -th video333To simplify the notation, we assume that all videos have the same frame-rate and that all the videos were sampled starting at , but it is not a limitation of the method. Note that each video might have a different length and in practice these lengths vary a lot (see Appx E for datasets statistics). Our goal is to train a generative model over video signals, having only their subsampled versions. To achieve this, we develop the following framework.

We build the model on top of StyleGAN2  [StyleGAN2-ADA] and redesign its generator and discriminator networks for video synthesis. Our generator is conceptually similar to MoCoGAN [MoCoGAN], i.e., we separate latent information into content code and motion trajectory . In contrast to MoCoGAN, our motion codes are continuous in time and we describe their design in Sec 3.1. The only modification we do on top of StyleGAN2’s generator is the concatenation of our continuous motion codes

to its constant input tensor. In all other aspects, it is

entirely equivalent to its image-based counterpart.

The discriminator model takes frames of a sparsely sampled video, independently extracts features from them, concatenates those features together channel-wise into a global video descriptor and predicts the real/fake class from it. is conditioned on the time distances between frames and we use hypernetwork-based conditioning to input this information.

Figure 5: Our generator architecture: the only change we do on top of StyleGAN2 generator’s synthesis network is the concatenation of our motion codes to the constant input tensor. produces frames independently from each other using the content code and motion code . Since is non-autoregressive, our video generator is also non-autoregressive. Synthesis blocks are utilized without any changes from StyleGAN2 [StyleGAN2]. For -resolution datasets, we remove the last two blocks from .

3.1 Generator structure

Overview. Generator consists of three components: content mapping network , motion mapping network and synthesis network . and are equivalent to their StyleGAN2’s counterparts with the exception that we tile and concatenate motion codes to the constant input tensor of .

A video is generated the following way. First, we sample the content noise and, following StyleGAN2, transform it into latent code . It is shared for all timesteps of a video. Then, to generate a frame in the specified time location , we first compute its motion code , which is done in three steps. First, we sample a discrete sequence of equidistant trajectory noise (we assume everywhere), positioned at distance from one another. The number of tokens is determined by the condition , i.e. it should be long enough to cover the desired timestep .444In practice, since uses padding-less convolutions, this sequence is slightly larger. We elaborate on this in Appx B. Then, we process it with conv1d-based motion mapping network with a large kernel size into the sequence . After that, we take a pair of tokens which lies between (i.e. for some and ) and compute an acyclic positional embedding

from them, described next. This positional embedding serves as the motion code for our generator. In fact, we do not need to sample all the motion noise vectors

to produce , but only those ones which depends on. In this way, our generator can produce frames non-autoregressively.

Acyclic positional encoding. Traditional positional embeddings [SIREN, FourierFeatures] are cyclic by default. This does not create problems in traditional applications (like image or scene representations) because utilized spatial domain there never exceeds the period length [NeRF, INR-GAN]. But for video generation, cyclicity is not desirable, because it makes a video getting looped at some point. To solve this issue, we develop the acyclic positional encoding mechanism.

A sine-based positional embedding vector can be expressed in the following form:


where denotes element-wise vector multiplication, are amplitudes, periods and phases of the corresponding waves, and the sine function is applied element-wise. By default, these embeddings are periodic and always the same for any input [SIREN, FourierFeatures, NeRF], which is not desirable for video synthesis, where natural videos contain different motions and are typically aperiodic. To solve this issue, we compute the wave parameters from motion noise the following way. First, “raw” motion codes are computed using wave parameters predicted from :




and are learnable weight matrices. Using directly as motion codes does not lead to good results since it contains discontinuities (see Fig 8(d)). That’s why we “stitch” their start and end values via:


where is a learnable weight matrix and lerp is the element-wise linear interpolation between and using the time position . The first subtraction in Eq (4) alters the positional embeddings to make them converge to zero values at locations . This limits the expressive power of the positional embeddings and that’s why we add the “alignment” vectors to restore it. See Fig 8(e) for the visualization.

In practice, we found it useful to compute periods as:


where is a vector of ones and are linearly-spaced scaling coefficients. See Appx B and the source code for details.

One could try using continuous codes directly as motion codes instead of . This also eliminates cyclicity (in theory), but leads to poor results in practice: if the distance is small, then the motion trajectory will contain unnatural sharp transitions; and when is increased, loses its ability to properly model high-frequency motions (like blinking) since the codes change too slowly. We empirically validate this Tab 2 (also see samples on the project webpage).

3.2 Discriminator structure

Modern video generators typically utilize two separate discriminators which operate on image and video levels separately [DVD_GAN, MoCoGAN, MoCoGAN-HD]. But since we train on extremely sparse videos and aim to have a computationally efficient model, we propose to use a holistic hypernetwork-based discriminator , which is conditioned on the time distances between frames . It consists of two parts: 1) feature extractor backbone , which independently embeds an image frame into a 3D feature vector ; and the convolutional head , which takes the concatenation of all the features

and outputs the real/fake logit


We input the time distances information between frames into the following way. First, we encode them with positional encoding [SIREN, FourierFeatures], preprocess with a 2-layer MLP into and concatenate into a single vector . After that, we use to modulate the weights of each first convolutional layer in each block of and and also as a conditional vector in the projection head [ProjectionDiscriminator] in the StyleGAN2’s DiscrEpilogue block. The modulation in is equivalent to modulation in , but uses instead of

vector: it is passed through a mapping network, transformed with an affine layer and multiplied on a 4D weight tensor of a convolutional layer across the input channel axis. We do not modulate

each convolutional layer from practical considerations: in practice, ModConv2d is heavier than Conv2d. The overall architecture is visualized in Fig 6.

Figure 6: Discriminator architecture for frames per video. Traditional StyleGAN2’s DiscrBlock [StyleGAN2] is replaced with ModDiscrBlock to use ModConv2d. DiscrEpilogue is equivalent to StyleGAN2’s implementation: we refer an interested reader to [StyleGAN2] for details on it. ModConv2d is the Conv2d layer with weight demodulation from StyleGAN2’s generator. The discriminator is conditioned on in two ways: via ModConv2d and via the projection [ProjectionDiscriminator] in DiscrEpilogue (the default conditioning scheme of StyleGAN2).

Such an approach is greatly more efficient than using dual discriminators and since we no longer need the expensive conv3d-based discriminator, which is too expensive to operate on high-resolution videos [DVD_GAN]. Moreover, as we demonstrate in Fig 4, it provides a more informative learning signal to .

Figure 7: Random samples from the existing methods on FaceForensics , SkyTimelapse and RainbowJelly , respectively. We sample a 64-frames video and display each 4-th frame, starting from .

3.3 Implicit assumptions of sparse training

Videos are continuous signals and any practical video generator relies on some sort of subsampling. The question of how many samples are necessary to train a video synthesis model is fundamental, because this design decision greatly influences the quality and training cost. In our work, we empirically show that one can train a state-of-the-art video generator with as few as 2 frames per video.

Consider the problem of learning a probability distribution

and consider that we utilize sparse training, i.e. select coordinates of the vector randomly on each iteration of the optimization process. Then our optimization objective is equivalent to learning all possible marginal distributions

instead of learning the joint distribution

. When does learning marginals allow to obtain the full joint distribution at the end? The following simple statement adds some clarity to this question.

A trivial but serviceable statement.

Let’s denote by a collection of sets of up to indices s.t. we have for all . In other words, is a set of up to indices . Then, can be represented as a product of marginals for if and only if there exists s.t. .

The above statement is primitive (see the proof in Appx F) but can provide useful practical intuition. For video synthesis, it implies that one can learn a video generator by using only frames per video only if for any frame , there exists at most previous frames sufficient to properly predict it (see Appx F). And we argue that for the modern video synthesis datasets, one does not need a lot of frames to make such a prediction. For example, for SkyTimelapse [SkyTimelapse_dataset], the motions are typically unidirectional and thus easily predictable from only 2 previous frames (which corresponds to training with frames per video). But surprisingly, in practice we found that using even frames per clip can provide state-of-the-art performance.

We treat videos as infinite continuous signals, but in practice one has to set a limit on the maximum time location which can be seen during training. To the best of our knowledge, previous methods use at most [TGANv2, Hierarchical_video_generation], but in our case we train the model with , which does not lead to much additional computational burden due to the non-autoregressive nature of our generator. However, we set the maximum distance between and to 32 to cover short-term and medium-term movements: otherwise, we observed unstable training and abrupt motions in video samples. To sample frames, we first sample the distance and then . After that, frames locations for are selected at random without repetitions.

Method FaceForensics SkyTimelapse UCF101 RainbowJelly
Training cost
MoCoGAN [MoCoGAN] 124.7 257.3 206.6 575.9 2886.9 3679.0 1572.9 549.7 5
 + StyleGAN2 backbone 55.62 309.3 85.88 272.8 1821.4 2311.3 638.5 463.0 8
MoCoGAN-HD [MoCoGAN-HD] 111.8 653.0 164.1 878.1 1729.6 2606.5 579.1 628.2 7.5 + 9
VideoGPT [VideoGPT] 185.9 N/A 222.7 N/A 2880.6 N/A 136.0 N/A 16 + 16
DIGAN [DIGAN] (concurrent work) 62.5 1824.7 83.11 196.7 1630.2 2293.7 436.6 369.0 16
StyleGAN-V (ours) 47.41 89.34 79.52 197.0 1431.0 1773.4 195.4 262.5 8
Table 1: Quantitative performance and training cost of different methods. We trained all the methods from scratch on resolution datasets using the official codebases and evaluated them under the unified evaluation protocol (see Sec 4). Training was done on 32 GB NVidia V100 GPUs for all the methods except VideoGPT, which was trained on NVidia A6000 GPUs (with 48.5 GB of memory each) due to its high memory consumption. For 2-stage methods, we report their training cost in the “” format. VideoGPT was trained for our maximum resource constraint of 32 GPU-days which was detrimental to its performance on resolution. Vanilla StyleGAN2 training time on resolution (with mixed precision and optimizations [StyleGAN2-ADA]) is 7.72 GPU-days in our environment.

4 Experiments

Datasets. We test our model on 5 benchmarks: FaceForensics  [FaceForensics_dataset], SkyTimelapse  [SkyTimelapse_dataset], UCF101  [UCF101_dataset], RainbowJelly (introduced by us) and MEAD  [MEAD_dataset]. We used the train splits (when they are available) for all the datasets except for UCF101, which is an extremely difficult dataset and we used train+test splits for it. We noticed that modern video synthesis datasets are either too simple or too difficult in terms of content and motion, and there are no datasets “in-between”. That’s why we introduce RainbowJelly: a dataset of “floating” jellyfish from the Hoccori Japan youtube channel, which contains 8 hours of videos in total and has resolution. It contains simple content but complex hierarchical motions and this makes it a challenging but approachable test-bed for evaluating modern video generators. We provide its details in Appx E. All the datasets have FPS, except for RainbowJelly and MEAD, which have 30 FPS.

Evaluation. Following prior work, we use Frechet Video Distance (FVD) [FVD]

and Inception Score (IS) as our evaluation metrics with FVD being the main one since FID (its image-based counterpart) better aligns with human-perceived quality

[FID]. We use two versions of FVD: FVD and FVD, which use 16-frames-long and 128-frames-long videos respectively to compute their statistics. Inception Score is used only to evaluate the generation quality on UCF-101 since it uses a UCF-101-finetuned C3D model [TGAN].

The official FVD project [FVD] does not provide the complete implementation of the evaluation pipeline, but rather the inference script for a single batch of videos, which are required to be already resized to and loaded into memory. This creates discrepancies in the evaluation protocols used by previous works since FVD, (similar to FID [FID_evaluation]) is very sensitive to the interpolation procedures and perceptually unnoticeable artifacts introduced by data processing procedures, like JPEG compression. We also found it to be very sensitive to how one extracts clips from real videos to compute the statistics. We implement and release a complete FVD evaluation protocol and use it to evaluate all the methods for fair comparison. It is documented in Appx C.

Figure 8: Generations on MEAD [MEAD_dataset] for MoCoGAN-HD [MoCoGAN-HD] and our method. MoCoGAN-HD cannot preserve the identity and diverges for long LSTM unrolling. (Note that all videos in the dataset have static head positions — see Appx E).

Baselines. We use 5 baselines for comparison: MoCoGAN [MoCoGAN], MoCoGAN [MoCoGAN] with the StyleGAN2 [StyleGAN2] backbone, VideoGPT [VideoGPT], MoCoGAN-HD [MoCoGAN-HD] and DIGAN [DIGAN]. For MoCoGAN with the StyleGAN2 backbone (denoted as MoCoGAN-SG2), we replaced its generator and image-based discriminator with the corresponding StyleGAN2’s components, leaving its video discriminator unchanged. We also used the training scheme and regularizations from StyleGAN2. MoCoGAN was trained for 5 days on a single GPU since its lightweight DC-GAN[DC_GAN] backbone makes it fast to train, while MoCoGAN+SG2 was trained for 2 days on GPUs to reach 25M real images seen by its image-based discriminator. MoCoGAN-HD is trained for 4.5 days on v100 GPUs, as specified in the original paper (Appx B of [MoCoGAN-HD]). We trained VideoGPT for the maximum affordable total time of 32 GPU-days in our resource constraints. We trained DIGAN [DIGAN]

for 5 days since we observed that by that time the metrics either plateaued or exploded (for RainbowJelly). We also noticed that DIGAN uses weighted sampling during training by selecting clips from long videos with higher probabilities. This hurts its FVD score and that’s why we altered its data sampling strategy to the uniform one, which is used by other methods 

[MoCoGAN, MoCoGAN-HD, VideoGPT]. For each method we selected the checkpoint with the best FVD value.

4.1 Main experiments

For the main evaluation, we train our method and all the baselines from scratch on the described datasets. Each model is trained on NVidia V100 32 GB GPUs, except for VideoGPT, which is very demanding in terms of GPU memory for resolution and we had to train it on NVidia A6000 instead (with the overall batch size of 4

). For our method and MoCoGAN+SG2, we use exactly the same optimization scheme as StyleGAN2, including the loss function, Adam optimizer hyperparameters and R1 regularization 

[R1_reg]. We reduce the learning rate by 10 for the module of MoCoGAN+SG2 since it does not have equalized learning rate [ProGAN]. We use for all the experiments except for SkyTimelapse, where we used . See other training details in Appx B. We evaluate all the methods under the same evaluation protocol, described in Appx C and report the results in Table 1.

To measure the efficiency, we use the amount of GPU days required to train a method. We build on top of the official StyleGAN2 implementation.555 The training cost of the image-based StyleGAN2 to reach its specified 25M images is NVidia V100 GPU-days in our environment. StyleGAN-V is trained for 2 days, which corresponds to 23M real frames seen by the discriminator. MoCoGAN-HD is built on top of


’s codebase666, which is times slower than the highly optimized NVidia’s implementation. That’s why in Table 1 we report its training cost reduced by a factor of 2 to account for this.

Method FaceForensics SkyTimelapse
Default StyleGAN-V 47.41 89.34 79.52 197.0
w/o our 65.88 41.77 109.1 240.2
w/o our 154.0 139.1 236.9 258.0
w/o hyper-modulation in 88.8 161.5 69.8 184.1
w/o projection cond in 95.4 236.0 102.1 210.3
w LSTM codes, 131.9 159.1 135.7 196.1
w LSTM codes, 180.3 94.55 95.71 165.8
Table 2: Ablating architectural components of our model.
Number of frames FaceForensics SkyTimelapse
60.41 93.5 50.5 209.9
(default) 47.41 89.34 79.52 197.0
51.84 114.9 65.7 194.5
101.9 211.4 73.12 215.9
92.52 192.8 107.6 254.3
Table 3: Ablating the amount of frames per clip used during training. Sparse training provides better results for our method.

Our method significantly outperforms the existing ones at almost all the benchmarks in terms FVD and FVD. On UCF101, all the methods perform poorly, which shows that it is a too difficult benchmark for modern video generators and one needs to train models of extreme scale to fit it [DVD_GAN]. For completeness, the Inception Score [TGAN] results are provided in Table 5 in Appx C.

We visualize the samples in Fig 1 and Fig 7. Our method is able to generate hour-long plausibly looking videos, though the motion diversity and global motion coherence for them would be limited (see Appx A). MoCoGAN-HD suffers from the LSTM instability when unrolled to large lengths and does not produce diverse motions. DIGAN produces high-quality videos on SkyTimelapse because its inductive bias of having joint spatio-temporal positional information is well suited for videos that have an entire scene moving. But for FaceForensics, this leads to a “head flying away” effect (see Appx H). To generate 1-hour long videos from MoCoGAN-HD, we unroll its LSTM model to the required depth ( steps) and synthesize frames only in the necessary time positions, while DIGAN, similar to our method, is able to generate frames non-autoregressively.

4.2 Ablations

To ablate the core components, we replaced or modules with their MoCoGAN+SG2 counterparts. In the both cases, their removal leads to poor short-term and long-term video quality, as specified by the corresponding metrics in Table 2 and video samples in the supplementary.

Replacing continuous motion codes with , produced by the LSTM model hurts the performance, especially when the distance between motion codes is small. This happens due to unnaturally abrupt transitions between frames and we provide the corresponding samples in the supplementary. The corresponding results are in Table 2.

Another architectural decision which we verify is the conditioning scheme utilized in . We use both ModConv2d blocks and the projection conditioning head [ProjectionDiscriminator] in DiscrEpilogue in our discriminator to provide the conditioning signal about the time differences between frames . Removing any of them hurts the performance, because it constrains the ability of to understand the temporal scale it is currently operating on. We ablated the hypernetwork-based modulation by feeding a vector of zeros instead of positional embeddings of time differences into ModConv2d

so that only the bias vectors in the corresponding affine transforms participate in the modulation process.

An important design choice is how many samples per video one should use during training. We try different values of for and and report the corresponding results in Table 3. As being discussed in Sec 3.3, for existing video generation benchmarks, it might be enough to sample only several frames per each video, and our experiments confirm this observation. The performance is decreased for larger , but this might be attributed to a weaker temporal aggregation procedure of , which simply concatenates features together. It is surprising to see that modern datasets can be fit with as few as 2 samples per video.

4.3 Properties

Our generator is able to generate arbitrarily long videos. Our design of motion codes allows StyleGAN-V not to suffer from stability problems when unrolled to large (potentially infinite) video lengths. This is verified by visualizing the video clips for the extremely large timesteps in Fig 1 and Fig 8. We also demonstrate its ability to produce videos in arbitrarily high frame-rate in the supplementary.

Our model has the same latent space manipulation properties as StyleGAN2. To show this, we conduct two experiments: embedding, editing and animating an off-the-shelf image and editing and animating the first frame of a generated video. To embed an image, we used the optimization procedure similar to [Image2StyleGAN], but considering it to be positioned at . To edit an image with CLIP, we used the procedure of [StyleCLIP]. The results of these experiments are visualized in Fig 2 and we provide the details in Appx B and more examples in the supplementary. Apart from showing the good properties of its latent space, these experiments demonstrate the extrapolation potential of our generator.

StyleGAN-V has almost the same training efficiency and image quality as StyleGAN2. In Fig 3, we plot the FID scores and training costs of modern video generators on FaceForensics by their corresponding FVD scores. Our method comes very close to StyleGAN2: it converges to FID of 9.44 in 8 GPU-days compared to FID of 8.42 in 7.72 GPU-days for StyleGAN2, which is only 10% worse. This raises a question of whether video generators can be as computationally efficient and good in terms of image quality as image-based ones.

Our model is the first one which is directly trainable on resolution. We provide the generations on MEAD for our method and for MoCoGAN-HD. MoCoGAN-HD cannot preserve the identity of a speaker and diverges for large video lengths, while our method achieves comparable image quality and coherent motions. For this dataset, our model was trained for 7 days on NVidia v100 GPUs and obtained FID of 24.12 and FVD of 156.1. Image generator for MoCoGAN-HD was trained for days on A6000 GPUs, while its video generator was trained for only days since it didn’t require high-resolution training.

Our discriminator provides more informative learning signal to . Fig 4 visualizes the gradient signal to the generator from our discriminator and the conv3d-based video discriminator of MoCoGAN-HD, measured at of training for our method (at 10M images seen by

) and MoCoGAN-HD (at the 300-th epoch). In our case, one can easily see fine-grained details of the face structure, perceived by

, while in case of MoCoGAN-HD, most of the gradient is redundant and lack any structural information.

Content and motion decomposition. Similar to MoCoGAN [MoCoGAN], our generator captures content and motion variations in a disentangled manner: altering motion codes while fixing does not change the appearance variations (like, a speaker’s identity). Similarly, re-sampling does not influence motion patterns on a video, but only its content. We provide the corresponding visualizations on the project website.

5 Conclusion

In this work, we provided a different perspective on time for video synthesis and built a continuous video generator using the paradigm of neural representations. For this, we developed motion representations through the lens of positional embeddings, explored sparse training of video generators and redesigned a typical dual structure of a video discriminator. Our model is built on top of StyleGAN2 and features a lot of its perks, like efficient training, good image quality and editable latent space.

We hope that our work would serve as a solid basis for building more powerful video generators in the future. The limitations and potential negative impact are discussed in Appx A.


Appendix A Limitations and potential negative impact

a.1 Limitations

Our model has the following limitations:

  • Limitations of sparse training. In general, sparse training makes it impossible for to capture complex dependencies between frames. But surprisingly, it provides state-of-the-art results on modern datasets, which (using the statement from Sec 3.3) implies that they are not that sophisticated in terms of motion.

  • Dataset-induced limitations

    . Similar to other machine learning models, our method is bound by the dataset quality it is trained on. For example, for FaceForensics

    dataset [FaceForensics_dataset], our embedding and manipulations results are inferior to StyleGAN2 ones [Image2StyleGAN]. This is due to the limited number of identities (just  700) in FaceForensics and their larger diversity in terms of quality compared to FFHQ [StyleGAN], which StyleGAN2 was trained on.

  • Periodicity artifacts. still produces periodic motions sometimes, despite of our acyclic positional embeddings. Future investigation on this phenomena is needed.

  • Poor handling of new content appearing. We noticed that our generator tries to reuse the content information encoded in the global latent code as much as possible. It is noticeable on datasets where new content appears during a video, like Sky Timelapse or Rainbow Jelly. We believe it can be resolved using ideas similar to ALIS [ALIS].

  • Sensitivity to hyperparameters. We found our generator to be sensitive to the minimal initial period length (See Appx B). We increased it for SkyTimelapse [SkyTimelapse_dataset] from 16 to 256: otherwise it contained unnatural sharp transitions.

We plan to address those limitations in our future works.

a.2 Potential negative impact

The potential negative impact of our method is similar to those of traditional image-based GANs: creating “deepfakes” and using them for malicious purposes.777 Our model made it much easier to train a model which produces much more realistic video samples with a small amount of computational resources. But since the availability of high-quality datasets is very low for video synthesis, the resulted model will fall short compared to its image-based counterpart, which could use rich, extremely qualitative image datasets for training, like FFHQ [StyleGAN].

Appendix B Implementation and training details

Note, that all the details can be found in the source code:

b.1 Optimization details and hyperparameters

Our model is built on top of the official StyleGAN2-ADA [StyleGAN2-ADA] repository888 In this work, we build a model to generate continuous videos and a reasonable question to ask was why not use INR-GAN [INR-GAN] instead (like DIGAN [DIGAN]) to have fully continuous signals? The reason why we chose StyleGAN2 instead of INR-GAN is that StyleGAN2 is amenable to the mixed-precision training, which makes it train times faster. For INR-GAN, enabling mixed precision severely decreases the quality and we hypothesize the reason if it is that each pixel in INR-GAN’s activations tensor carries more information (due to the spatial independence) since the model cannot spatially distribute information anymore. And explicitly restricting the range of possible values adds a strict upper bound on the amount of information one each pixel is able to carry. We also found that adding coordinates information does not improve video quality for our generator neither qualitatively, nor in terms of scores.

Similar to StyleGAN2, we utilize non-saturating loss and regularization with the loss coefficient of 0.2 in all the experiments, which is inherited from the original repo and we didn’t try any hyperparameter search for it. We also use the fmaps parameter of 0.5 (the original StyleGAN2 used fmaps parameter of 1.0), which controls the channel dimensionalities in and , since it is the default setting for StyleGAN2-ADA for resolution. This allowed us to further speedup training.

The dimensionalities of are all set to 512.

As being stated in the main text, we use a padding-less conv1d-based motion mapping network with a large kernel size to generate raw motion codes . In all the experiments, we use the kernel size of

and stride of

. We do not use any dilation in it despite the fact that they could increase the temporal receptive field: we found that varying the kernel size didn’t produce much benefit in terms of video quality. Using padding-less convolutions allows the model to be stable when unrolled at large depths. We use 2 layers of such convolutions with a hidden size of 512. Another benefit of using conv1d-based blocks is that in contrast to LSTM/GRU cells one can practically incorporate equalized learning rate [ProGAN] scheme into it.

Using conv1d-based motion mapping network without paddings forces us to use “previous” motion noise codes . That’s why instead of sampling a sequence , we sample a slightly larger one to adjust for the reduced sequence size. For the same-padding strategy, for sampling a frame at position , we would need to produce motion noise codes . But with our kernel size of 11, with 2 layers of convolutions and without padding, the resulted sequence size is .

The training performance of VideoGPT on UCF101 is surprisingly low despite the fact that it was developed for such kind of datasets [VideoGPT]. We hypothesize that this happens due to UCF101 being a very difficult dataset and VideoGPT being trained with the batch size of 4 (higher batch size didn’t fit our 200 GB GPU memory setup), which damaged its ability to learn the distribution.

To train our model, we also utilized adaptive differentiable augmentations of StyleGAN2-ADA [StyleGAN2-ADA], but we found it important to make them video-consistent, i.e. applying the same augmentation for each frame of a video. Otherwise, the discriminator starts to underperform, and the overall quality decreases. We use the default bgc augmentations pipe from StyleGAN2-ADA, which includes horizontal flips, 90 degrees rotations, scaling, horizontal/vertical translations, hue/saturation/brightness/contrast changes and luma axis flipping.

While training the model, for real videos we first select a video index and then we select random clip (i.e., a clip with a random offset). This differs from the traditional DIGAN or VideoGPT training scheme, that’s why we needed to change the data loaders to make them learn the same statistics and not get biased by very long videos.

To develop this project, NVidia v100 32GB GPU-years + NVidia A6000 GPU-years were spent.

b.2 Projection and editing procedures

In this subsection, we describe the embedding and editing procedures, which were used to obtain results in Fig 2.

Projection. To project an existing photogrpah into the latent space of , we used a procedure from StyleGAN2 [StyleGAN2], but projecting into space [Image2StyleGAN] instead of , since it produces better reconstruction results and does not spoil editing properties. We set the initial learning rate to and optimized a code for LPIPS reconstruction loss [LPIPS] for 1000 steps using Adam. For motion codes, we initializated a static sequence and kept it fixed during the optimization process. We noticed that when it is also being optimized, the reconstruction becomes almost perfect, but it breaks when another sequence of motion codes is plugged in.

Editing. Our CLIP editing procedure is very similar to the one in StyleCLIP [StyleCLIP], with the exception that we embed an image assuming that it is a video frame in location . On each iteration, we resample motion codes since all our edits are semantic and do not refer to motion. We leave the motion editing with CLIP for future exploration. For the sky editing video presented in Fig 2, we additionally utilize masking: we initialize a mask to cover the trees and try not to change them during the optimization using LPIPS loss. For all the videos, presented in the supplementary website, no masking is used.

The details can be found in the provided source code.

b.3 Additional details on positional embeddings

Mitigating high-frequency artifacts. We noticed that if our periods are left unbounded, they might grow to very large values (up to magnitude of ), which corresponds to extra high frequencies (the period length becomes less than 4 frames) and leads to temporal aliasing. That’s why we process them via the transform: this bounds them into range with the mean of 1.0, i.e. using the at-initialization frequency scaling, which we discuss next.

Linearly spaced periods. An important design decision is the scaling of periods since at initialization it should cover both high-frequency and low-frequency details. Existing works use either exponential scaling (e.g., [NeRF, Nerfies, ALIS, SAPE]) or random scaling (e.g., [SIREN, FourierFeatures, INR-GAN, CIPS]). In practice, we scale the -th column of the amplitudes weight matrix with the value:


where we use frames and frames in all the experiments, except for SkyTimelapse, for each we use . We call this scheme linear scaling and use it as an additional tool to alleviate periodicity since it greatly increases the overall cycle of a positional embedding (see Fig 9). See also the accompanying source code for details.

(a) Exponentially spaced periods [NeRF]: , cycle length is 64.
(b) Random periods [SIREN, FourierFeatures]: , cycle length is 120 (for the depicted ).
(c) Linearly spaced periods (ours): , cycle length is 352.
(d) Raw acyclic positional embeddings : , no cyclicity. While such embeddings are acyclic, they have discontinuities at stitching points.
(e) Stitched raw acyclic positional embeddings without alignment vectors: , no cyclicity. Stitching raw positional embeddings without using “aligners” removes discontinuities, but reduces the expressive power of positional embeddings since they have zero values at time locations .
(f) Acyclic periods with linearly-spaced scaling (ours): , no cyclicity. Notice that the frequencies and phases are controlled by the motion mapping network : for example, it has the possibility to accelerate some motion (like the one represented by the red curve) by increasing its frequency.
Figure 9: Visualizing positional embeddings for different initialization strategies of periods scales . The cycle length is the minimum value of for which the positional embedding vector starts repeating itself (it is computed as a least common multiple of all the individual periods lengths). Existing works use either exponentially spaced or random scaling, but in our case we use the linearly spaced one since it has a very large global cycle (in contrast to exponential scaling) and is guaranteed to include high-frequency, medium-frequency and high-freqency waves (in contrast to random scaling).

Another benefit of using our positional embeddings over LSTM is that they are “always stable”, i.e. they are always in a suitable range.

Appendix C Evaluation details

For the practical implementation, see the provided source code:

In this section, we describe the difficulties of a fair comparison of the FVD score. There are discrepancies between papers in computing even FID [FID_evaluation]. So, it is less surprising that computing FVD for videos diverge even more and has even more implications for methods evaluation.

First, we note that I3D model [i3d] has different weights on tf.hub — the model which is used in the official FVD repo.999 — compared to its official release in the official github repo implementation 101010 That’s why we manually exported the weights from tf.hub and used this github repo 111111 to obtain an exact implementation in Pytorch.

There are several issues with FVD metric on its own. First, it does not capture motion collapse, which can be observed by comparing FVD and FVD scores between StyleGAN-V and StyleGAN-Vwith LSTM motion codes instead of our ones: the latter one has a severe motion collapse issue (see the samples on our website) and has similar or lower FVD scores compared to our model: 196.1 or 165.8 (depending on the distance between anchors) vs 197.0 for our model. Another issue with FVD calculation is that it is biased towards image quality. If one trains a good image generator, i.e. a model which is not able to generate any videos at all, then FVD will still be good for it even despite the fact that it would have degenerate motion.

We also want to make a note on how we compute FID for vidoe generators. For this, we generate 2048 videos of 16 frames each (starting with ) and use all those frames in the FID computation. In this way, it gives 33k images to construct the dataset, but those images will have lower diversity compared to a typically utilized 50k-sized set of images from a traditional image generator [StyleGAN]. The reason of it is that 16 images in a single clip likely share a lot of content. A better strategy would be to generate 50k videos and pick a random frame from each video, but this is too heavy computationally for models which produce frames autoregressively. And using just the first frame in FID computation will unfairly favour MoCoGAN-HD, which generates the very first frame of each video with a freezed StyleGAN2 model.

FVD is greatly influenced by 1) how many clips per video are selected; 2) with which offsets; and 3) at which frame-rate. For example, SkyTimelapse contains several extremely long videos: if we select as many clips as possible from each real video, that it will severely bias the statistics of FVD. For FaceForensics, videos often contain intro frames during their first 0.5-1.0 seconds, which will affect FVD when a constant offset of is chosen to extract a single clip per video.

That’s why we use the following protocol to compute FVD.

Computing real statistics. To compute real statistics, we select a single clip per video, chosen at a random offset. We use the actual frame-rate of the dataset, which the model is being trained on, without skipping any frames. The problem of such an approach is that for datasets with small number of long videos (like, FaceForensics, see Table 7

) might have noisy estimates. But our results showed that the standard deviations are always

even for FaceForensics . The largest standard deviation we obserbed was when computing FVD on RainbowJelly: on this dataset it was for VideoGPT, but it is of its overall magnitude.

Computing fake statistics. To compute fake statistics, we generate 2048 videos and save them as frames in JPEG format via the Pillow library. We use the quality parameter for doing this, since it was shown to have very close quality to PNG, but without introducing artifacts that would lead to discrepancies [FID_evaluation]. Ideally, one would like to store frames in the PNG format, but in this case it would be too expensive to represent video datasets: for example, MEAD would occupy terabytes of space in this case.

Method FF ST
Proper computation 76.82 61.95
When resized to 38.92 59.86
With jpg/png discrepancy 80.17 71.40
When using all clips per video 84.59 72.03
When using only first frames 91.64 59.74
When using subsampling of 82.88 90.21
Still real images 342.5 166.8
Table 4: Subtleties of FVD calculation. We report different ways of calculating FVD on FaceForensics (FF) and SktTimelapse (ST) for one of our checkpoints. We show how the scores of StyleGAN-V are influenced a lot when different strategies of FVD calculation are employed. See the text for the description of each row.

We illustrate the subtleties of FVD computation in Table 4. For this, we compute real/fake statistics for our model in several different ways:

  • Resized to . Both fake and real statistics images are resized into resolution via the pytorch bilinear interpolation (without corners alignment) before computing FVD.

  • JPG/PNG discrepancy. Instead of saving fake frames in JPG with , we use parameter in the PIL library. This creates more JPEG-like artifacts, which, for example, FID is very sensitive to.

  • Using all clips per video. We use all available -frames-long clips in each video without overlaps. Note, that our model was trained

  • Using only first frames. In each real video, instead of using random offsets to select clips, we use the first frames.

  • Using subsampling. When sampling frames for computing real/fake statistics, we select each -th frame. This is the strategy which was employed for some of the experiments in the original paper [FVD] — but in their case, authors trained the model on videos with this subsampling.

For completeness, we also provide the Inception Score [TGAN] on UCF-101 dataset in Table 5. Note that is computed by resizing all videos to spatial resolution (due to the internal structure of the C3D [c3d] model), which makes it impossible for it to capture high-resolution details of the generated videos, which is the focus of the current work.

Method Inception Score [TGAN]
MoCoGAN [MoCoGAN] 10.090.30
MoCoGAN+SG2 (ours) 15.260.95
VideoGPT [VideoGPT] 12.610.33
MoCoGAN-HD [MoCoGAN-HD] 23.391.48
DIGAN [DIGAN] 23.161.13
StyleGAN-V (ours) 23.940.73
Real videos 97.230.38
Table 5: Inception Score [TGAN] on UCF101 (note that the underlying C3D model resizes the videos into resolution under the hood, eliminating high-quality details).

In Tab 6, we provide the numbers, used in Fig 3. Note that StyleGAN2 training in our case is slightly slower than the officially specified one (7.3 vs 7.7 GPU days)121212, which we attribute to a slightly slower file system on our computational cluster.

Method FVD FID Training cost
MoCoGAN [MoCoGAN] 124.7 23.97 5
MoCoGAN+SG2 (ours) 55.62 10.82 8
VideoGPT [VideoGPT] 185.9 22.7 32
MoCoGAN-HD [MoCoGAN-HD] 111.8 7.12 16.5
DIGAN [DIGAN] 62.5 19.1 16
StyleGAN-V (ours) 47.41 9.445 8
StyleGAN2 [StyleGAN2-ADA] N/A 8.42 7.72
Table 6: FVD, FID and training costs of modern video generators on FaceForensics . Training cost is measured in terms of GPU-days.

Appendix D Failed experiments

In this section, we provide a list of ideas, which we tried to make work, but they didn’t work either because the idea itself is not good, or because we didn’t put enough experimental effort into investigating it.

Hierarchical motion codes. We tried having several layers of motion codes. Each layer has its own distance between the codes. In this way, high-level codes should capture high-level motion and bottom-level codes should represent short local motion patterns. This didn’t improve the scores and didn’t provide any disentanglement of motion information. We believe that the motion should be represented differently (similar to FOMM [FOMM]), rather than with motion codes, because they make it difficult for to make them temporily coherent.

Maximizing entropy of motion codes to alleviate motion collapse. As an additional tool to alleviate motion collapse, we tried to maximize entropy of wave parameters of our motion codes. The generator solved the task of maximizing the entropy well, but it didn’t affect the motion collapse: it managed to save some coordination dimensions of specifically to synchronize motions.

Prorgressive growing of frequences in positional embeddings. We tried starting with low-frequencies first and progressively open new and new ones during the training. It is a popular strategy for training implicit neural representations on reconstruction tasks (e.g., [Nerfies, SAPE]), but in our case we found the following problem with it. The generator learned to use low frequencies for representing high-frequency motion and didn’t learn to utilize high frequencies for this task when they became available. That’s why high-frequency motion patterns (like blinking or speaking) were unnaturally slow.

Continuous LSTM with EMA states. Our motion codes use sine/cosine activations, which makes them suffer from periodic artifacts (those artifacts are mitigated by our parametrization, but still present sometimes). We tried to use LSTM, but with exponential moving average on top of its hidden states to smoothen out motion representations temporally. However, (likely due to the lack of experimental effort which we invested into this direction), the resulted motion representations were either too smooth or too sharp (depending on the EMA window size), which resulted in unnatural motions.

Concatenating spatial coordinates. INR-GAN [INR-GAN] uses spatial positional embeddings and shows that they provide better geometric prior to the model. We tried to use them as well in our experiments, but they didn’t provide any improvement neither in qualitatively, nor quantitatively, but made the training slightly slower (by %10) due to the increased channel dimensionalities.

Feature differences in . Another experiment direction which we tried is computing differences between activations of next/previous frames in a video and concatenating this information back to the activations tensor. The intuition was to provide information with some sort of “latent” optical flow information. However, it made too powerful (its loss became smaller than usual) and it started to outpace too much, which decreased the final scores.

Predicting instead of conditioning in . There are two ways to utilize the time information in : as a conditioning signal or as a learning signal. For the latter one, we tried to predict the time distances between frames by training an additional head to predict the class (we treated the problem as classification instead of regression since there is a very limited amount of time distances between frames which sees during its training). However, it noticeably decreased the scores.

Conditioning on video length. For unconditional UCF-101, it might be very important for to know the video length in advance. Because some classes might contain very short clips (like, jumping), while others are very long, and it might be useful for to know in advance which video it will need to generate (since we sample frames at random time locations during training). However, utilizing this conditioning didn’t influence the scores.

Appendix E Datasets details

e.1 Datasets details

We provide the dataset statistics in Fig 10 and their comparison in Table 7. Note, that for MEAD, we use only its front camera shots (originally, it releases shots from several camera positions).

(a) FaceForensics [FaceForensics_dataset].
(b) SkyTimelapse [SkyTimelapse_dataset].
(c) UCF-101 [UCF101_dataset].
(d) RainbowJelly [UCF101_dataset].
(e) MEAD [UCF101_dataset].
Figure 10: Distribution of video lengths (in terms of numbers of frames) for different datasets. Note that RainbowJelly and MEAD [MEAD_dataset] are 30 FPS, while the rest are 25 FPS datasets. Note that SkyTimelapse contains several very long videos which might bias the distribution if not treated properly.
Dataset #hours avg len FPS #speakers
FaceForensics [FaceForensics_dataset] 4.04 20.7s 25
SkyTimelapse [SkyTimelapse_dataset] 12.99 22.1s 25 N/A
UCF-101 [UCF101_dataset] 0.51 6.8s 25 N/A
RainbowJelly 7.99 17.1s 30 N/A
MEAD [MEAD_dataset] 36.11 4.3s 30
Table 7: Additional datasets information in terms of total lengths (in the total number of hours), average video length (in seconds), frame rate and the amount of speakers (for FaceForensics and MEAD).

e.2 Rainbow Jelly

For our RainbowJelly benchmark, we used the following film: It is an 8-hour-long movie of jellyfish in 4K resolution and 30 FPS from the Hoccori Japan youtube video channel. We cannot release this dataset due to the copyright restrictions, but we released a full script which processes it (see the provided source code). To construct a benchmark, we sliced it into 1686 chunks of 512 frames each, starting with the 150-th frame (to remove the loading screen), center-cropped and resized into resolution. This benchmark is advantageous compared to the existing ones in the following way:

  1. It contains complex hierarchical motions:

    • a jellyfish flowing in a particular direction (low-frequency global motion);

    • a jellyfish pushing water with its arms (medium-frequency motion)

    • small perturbations of jellyfish’s body and tentacles (high-frequency local motion).

  2. It is a very high-quality dataset (4K resolution).

  3. It is simple in terms of content, which makes the benchmark more focused on motions.

  4. It contains long videos.

Appendix F Implicit assumptions of sparse training

In this section, we elaborate on our simple theoretical exposition from Sec 3.3

Consider that we want to fit a probabilistic model to the real data distribution . For simplicity, we will be considering a discrete finite case, i.e. , but note that videos, while continuous and infinite in theory, are still discretized and have a time limit to fit on a computer in practice. For fitting the distribution, we use -sparse training, i.e. picking only random coordinates from each sample during the optimization process. In other words, introducing -sparse sampling reformulates the problem from




where is a problem-specific distance function between probability distributions, is a collection of all possible sets of unique indices and denotes a sub-vector of . This means, that instead of bridging together full distributions we choose to bridge all their possible marginals of length instead. When solving Eq. (8) will help us to obtain the full joint distribution ? To investigate this question, we develop the following simple statement.

Let’s denote by a collection of sets of up to indices s.t. we have for all .

Using the chain rule, we can represent



where denotes the sequence . Now, if we know that for each , there exists with s.t.:


then is obviously simplified to:


Does this tell anything useful? Surprisingly, yes. It says that if is simple enough that instead of using the whole history to model

it’s enough to use only some set “representative moments

(unique for each ) with the size , then -sparse training is a viable alternative. After fitting via -sparse training, we will be able to obtain using Eq (10) even though ! Note, that one can obtain a conditional distributional from the marginal one for some set of indicies via:


But we would also like to have the “reverse” dependency, i.e. knowing that if we can approximate the distribution via a set of marginals, then this distribution is not too difficult. For this claim, we will need to consider marginals not of an arbitrary form , but of the form , and we would need exactly of those. The reverse implication is the following. If can be represented as a product of conditionals , then for each there exists s.t. .

This statement, just like the previous one, looks obvious. But oddly, requires more than a single sentence to prove. First, we are given that:


but unfortunately, we cannot directly claim that each term in the product equals to its corresponding one in the product . For this, we first need to show that for each we have:


It can be seen from the fact, that:


This allows to cancel terms in the chain rule one by one, starting from the end, leading to the desired equality:


Does this reverse claim tells us anything useful? Surprisingly again, yes. It implies that if we managed to fit by using -sparse training, then this distribution is not sophisticated.

Merging the above two statements together, we see that can be represented as a product of conditionals for if and only if for all there exists s.t. .

What does this statement tell for video synthesis? Any video synthesis algorithm utilizes -sparse training to learn its underlying model, but in contrast to prior work, we use very small values of . This means, that we fit our model to model any -marginals of (considering that we pick frames uniformly at random) instead of the full one . And using the above statement, such a setup implies the assumption of Eq (10). This equation says that one can know everything about by just observing previous frames . In other words, must be predictable from . Moreover, it is easy to show that our statement can generalize to include several for -th frame, i.e. there might exist several explainable sets of frames.

Appendix G Additional samples

For the ease of visualization, we provide additional samples of the model via a web page:

Appendix H Comparison to DIGAN

Our model shares a lot of similarities to DIGAN [DIGAN] and in this section we highlight those similarities and differences.

h.1 Major similarities

Sparse training. DIGAN also utilizes very sparse training (only 2 frames per video). But in our case, we additionally explore the optimal number of frames per video (see Sec 3.3).

Continuous-time generator. DIGAN also builds a generator, which is continuous in time. But our generator does not lose the quality at infinitely large lengths.

Dropping conv3d blocks. DIGAN also drops conv3d blocks in their discriminator. But in contrast to us, they still have 2 discriminators.

h.2 Major differences

Motion representation. DIGAN uses only a single global motion code, which makes it theoretically impossible to generate infinite videos: at some point it will start repeating itself (due to the usage of sine/cosine-based positional embeddings). In our case, we use an infinite sequence of motion codes, which are being temporally interpolated, computed wave parameters from and transformed into motion codes. DIGAN mixes temporal and spatial information together into the same positional embedding, which creates the following problem: even when time changes, the spatial location, perceived by the model, also changes. This creates a “head-flying-away” effect (see the samples). In our case, we keep these two information sources decomposed from one another.

Generator’s backbone. DIGAN is built on top of INR-GAN [INR-GAN], while our work uses StyleGAN2. This allows DIGAN to inherit INR-GAN’s benefits from being spatially continuous, but at the expense of being less stable and being slower to train (due to the lack of mixed precision and increased channel dimensionalities from concatenating positional embeddings).

Discriminator structure. DIGAN uses two discriminators: the first one operates on image-level and is equivalent to StyleGAN2’s one, while the other one operates on “video” level and takes frames and the time differences between them , concatenates them all together into a 7-channel input image (tiling the time difference scalar) and passes into a model with StyleGAN2 discriminator’s backbone. In our case, we utilize a single hypernetwork-based discriminator.

Sampling procedure. We use samples per video, while DIGAN uses . Also, we sample frames uniformly randomly, while DIGAN selects and (in this way, DIGAN sometimes have ). Apart from that, they use .

Apart from those major distinctions, there are lot of small implementation differences. We refer an interested reader to the released codebases for them:

h.3 A note on the computational cost

INR-GAN demonstrated that it has higher throughput than StyleGAN2 in terms of images/second [INR-GAN]. But the authors compare to the original StyleGAN2 implementation and not to the one from StyleGAN2-ADA repo, which is much better optimized. Also, they use caching of positional embeddings which is only possible at test-time and has great influence on its computational performance. In this way, we found that that StyleGAN2 is times faster to train and is less consuming in terms of GPU memory than INR-GAN.

DIGAN is based on top of INR-GAN and that’s why suffers from the issues described above. We trained it for a week on v100 NVidia GPUs and observed that it stopped improving after days of training. This is equivalent to real frames seen by the discriminator (while MoCoGAN+SG2 and StyleGAN-V reach in just 2 days for the same resolution in the same environment). For the time of the submitting the main paper, there was no information about the training cost. However, the authors updated their manuscript for the time of submitting the supplementary and specify the training cost of 8 GPU-days resolution, which is consistent with our experiments (considering that we have twice as larger resolution).