styleganv
Source code for StyleGANV
view repo
Videos show continuous events, yet most  if not all  video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be  timecontinuous signals, and extend the paradigm of neural representations to build a continuoustime video generator. For this, we first design continuous motion representations through the lens of positional embeddings. Then, we explore the question of training on very sparse videos and demonstrate that a good generator can be learned by using as few as 2 frames per clip. After that, we rethink the traditional image and video discriminators pair and propose to use a single hypernetworkbased one. This decreases the training cost and provides richer learning signal to the generator, making it possible to train directly on 1024^2 videos for the first time. We build our model on top of StyleGAN2 and it is just 5 resolution while achieving almost the same image quality. Moreover, our latent space features similar properties, enabling spatial manipulations that our method can propagate in time. We can generate arbitrarily long videos at arbitrary high frame rate, while prior work struggles to generate even 64 frames at a fixed rate. Our model achieves stateoftheart results on four modern 256^2 video synthesis benchmarks and one 1024^2 resolution one. Videos and the source code are available at the project website: https://universome.github.io/styleganv.
READ FULL TEXT VIEW PDFSource code for StyleGANV
Recent advances in deep learning pushed image generation to the unprecedented photorealistic quality
[StyleGAN2ADA, BigGAN] and spawned a lot of its industry applications. Video generation, however, does not enjoy a similar success and struggles to fit complex realworld datasets. The difficulties are caused not only by the more complex nature of the underlying data distribution, but also due to the computationally intensive video representations employed by modern generators. They treat videos as discrete sequences of images, which is very demanding for representing long highresolution videos and induces the use of expensive conv3dbased architectures to model them [TGAN, MoCoGAN, TGANv2, DVD_GAN]. ^{2}^{2}2E.g., DVDGAN [DVD_GAN] requires K to train at resolution (as reported by [MoCoGANHD])In this work, we argue that this design choice is not optimal and propose to treat videos in their natural form: as continuous signals , that map any time coordinate into an image frame . Consequently, we develop a GANbased continuous video synthesis framework by extending the recent paradigm of neural representations [NeRF, SIREN, FourierFeatures] to the video generation domain.
Developing such a framework comes with three challenges. First, sine/cosine positional embeddings are periodic by design and depend only on the input coordinates. This does not suit video generation, where temporal information should be aperiodic (otherwise, videos will be cycled) and different for different samples. Next, since videos are perceived as infinite continuous signals, one needs to develop an appropriate sampling scheme to use them in a practical framework. Finally, one needs to accordingly redesign the discriminator to operate in the new sampling pipeline.
To solve the first issue, we develop positional embeddings with timevarying wave parameters which depend on motion information, sampled uniquely for different videos. This motion information is represented as a sequence of motion codes produced by a paddingless conv1dbased model. We prefer it over the usual LSTM network [MoCoGAN, MoCoGANHD, TGANv2, VideoGLO] to alleviate the RNN’s instability when unrolled to large depths and to produce frames nonautoregressively.
Next, we investigate the question of how many samples are needed to learn a meaningful video generator. We argue that it can be learned from extremely sparse videos (as few as 2 frames per clip), and justify it with a simple theoretical exposition in Sec 3.3 and practical experiments (see Table 2).
Finally, since our model sees only 24 randomly sampled frames per video, it is highly redundant to use expensive conv3dblocks in the discriminator, which are designed to operate on long sequences of equidistant frames. That’s why we replace it with a conv2dbased model, which aggregates information temporarily via simple concatenation and is conditioned on the time distances between its input frames. We use hypernetworkbased modulation [Hypernetworks] for this conditioning to make the discriminator more flexible in processing frames sampled at varying time distances. Such redesign improves training efficiency (see Table 1), provides more informative gradient signal to the generator (see Fig 4) and simplifies the overall pipeline (see Sec 3.2), since we no longer need two different discriminators to operate on image and video levels separately, as modern video synthesis models do (e.g., [MoCoGAN, TGANv2, DVD_GAN]).
We build our model, named StyleGANV, on top of the imagebased StyleGAN2 [StyleGAN2]. It is able to produce arbitrarily long videos at arbitrarily high framerate in a nonautoregressive manner and enjoys great training efficiency — it is only costlier than the classical imagebased StyleGAN2 model [StyleGAN2], while having only worse plain image quality in terms of FID [FID] (see Fig 3). This allows us to easily scale it to HQ datasets and we demonstrate that it is directly trainable on resolution.
For empirical evaluation, we use 5 benchmarks: FaceForensics [FaceForensics_dataset], SkyTimelapse [SkyTimelapse_dataset]
, UCF101
[UCF101_dataset], RainbowJelly (introduced in our work) and MEAD [MEAD_dataset]. Apart from our model, we train from scratch 5 different methods and measure their performance using the same evaluation protocol. Frechet Video Distance (FVD) [FVD] serves as the main metric for video synthesis, but there is no complete official implementation for it (see Sec 4 and Appx C). This leads to discrepancies in the evaluation procedures used by different works because FVD, similarly to FID [FID], is very sensitive to data format and sampling strategy [FID_evaluation]. That’s why we implement, document and release our complete FVD evaluation protocol. In terms of sheer metrics, our method performs on average better than the closest runnerup.Video synthesis. Early works on video synthesis mainly focused on video prediction [SiftFlow, UnsupervisedVisualPrediction], i.e. generating future frames given a sequence of the previously seen ones. Early approaches for this problem typically employed recurrent convolutional models trained with reconstruction objective [video_language_modeling, Robot_pushing_dataset, LSTMs_video_representations], but later adversarial losses were introduced to improve the synthesis quality [MultiScaleVideoPrediction, Video2Video, GeneratingVideosWithSceneDynamics]. Some recent works explore autoregressive video prediction with recurrent or attentionbased models (e.g., [VideoTransformer, LVT, VideoGPT, PredictingVideoWithVQVAE, VideoPixelNetworks]). Another close line of research is
video interpolation
, i.e. increasing the frame rate of a given video (e.g., [Video_interpolation_ASC, Video_interpolation_DAIN, Video_interpolation_SuperSloMo]). In our work, we study video generation, which is a more challenging problem than video prediction since it seeks to synthesize videos from scratch, i.e. without using the expressive conditioning on previous frames. Classical methods in this direction are typically based on GANs [GANs]. MoCoGAN [MoCoGAN] and TGAN [TGAN] decompose generator’s input noise into a content code and motion codes, which became a standard strategy for many subsequent works (e.g., [MoCoGANHD, TGANv2, VideoGLO, TemporalShiftGAN]). SVGAN [SelfSupervisedVideoGANs] additionally add selfsupervision losses to improve the synthesis.MoCoGANHD [MoCoGANHD] and StyleVideoGAN [StyleVideoGAN], similar to us, consider highresolution video synthesis. But in their case, the authors perform indirect training by training a motion codes generator in the latent space of a pretrained StyleGAN2 model. StyleGANV is trained on extremely sparse videos. This makes it related to [TGANv2, Hierarchical_video_generation, Inmodegan], which use a pyramid of discriminators operating on different temporal resolutions (with a subsampling factor of up to ). In contrast to the prior work, our generator is continuous in time. In this way it is similar to VidODE[VidODE]: a continuoustime video interpolation and prediction model based on neural ODEs [NODE].
To the best of our knowledge, all modern video synthesis approaches utilize expensive conv3d blocks either in their decoder and/or encoder components (e.g., [MoCoGAN, MoCoGANHD, TGANv2, TemporalShiftGAN, DVD_GAN, LDVDGAN, ProVGAN]). Often, GANbased approaches utilize two discriminators, operating on image and video levels independently, where the video discriminator operates at a low resolution to save computation (e.g., [MoCoGAN, G3AN, MoCoGANHD, DVD_GAN]). In our work, we show that it’s enough to use a single holistic hypernetworkbased [Hypernetworks] discriminator conditioned on the time difference between frames to build a stateoftheart temporarily coherent video generator. For conditioning, we use ModConv2d blocks from [StyleGAN2], which is similar to AnyCostGAN [AnyCostGAN].
Neural Representations
. Neural representations is a recent paradigm that uses neural networks to represent continuous signals, such as images, videos, audios, 3D objects and scenes (e.g.,
[NeRF, SIREN, FourierFeatures, SRNs, TemplateImplicitFunction]). It is mostly popular for 3D reconstruction and geometry processing tasks (e.g., [DeepSDF, DeepMeta, OccupancyNetworks, ConvolutionalOccupancyNetworks, DVR]), including videobased reconstruction [Nerfies, SpaceTimeNeIF, Dnerf, NSFF]. Several recent projects explored the task of building generative models over such representations to synthesize images (e.g., [INRGAN, CIPS, ALIS]), 3D objects (e.g., [GRAF, piGAN, NeRFVAE]) or multimodal signals (e.g., [INRs_distribution, GEM]), and our work extends this line of research to video generation.Concurrent works. The development of neural representationsbased approaches moves extremely fast and there are two concurrent works which propose ideas similar to our ones. DIGAN[DIGAN] is a concurrent project that explores the same direction of using neuralbased representations for continuous video synthesis and shares a lot of ideas with our work. The authors also consider a continuoustime generator, trained by a discriminator without conv3d layers. The core difference with our work is that they use a different parametrization of motions and use a dual discriminator : one operates on and the second one on individual images. We enumerate the differences and similarities in Appx H. NeRV [NeRV] is another concurrent project which proposes to represent videos as convolutional neural representations. But in their case, the authors explore compression and denoising tasks. GEM [GEM] utilizes generative latent optimization [GLO] to build a multimodal generative model.
Our model is based on the paradigm of neural representations [NeRF, SIREN, FourierFeatures], i.e. representing signals as neural networks. We treat each video as a function which is continuous in time . In this manner, the training dataset is a set of subsampled signals , where denotes the total number of videos, denotes the time position of the th frame and is the amount of frames in the th video^{3}^{3}3To simplify the notation, we assume that all videos have the same framerate and that all the videos were sampled starting at , but it is not a limitation of the method. Note that each video might have a different length and in practice these lengths vary a lot (see Appx E for datasets statistics). Our goal is to train a generative model over video signals, having only their subsampled versions. To achieve this, we develop the following framework.
We build the model on top of StyleGAN2 [StyleGAN2ADA] and redesign its generator and discriminator networks for video synthesis. Our generator is conceptually similar to MoCoGAN [MoCoGAN], i.e., we separate latent information into content code and motion trajectory . In contrast to MoCoGAN, our motion codes are continuous in time and we describe their design in Sec 3.1. The only modification we do on top of StyleGAN2’s generator is the concatenation of our continuous motion codes
to its constant input tensor. In all other aspects, it is
entirely equivalent to its imagebased counterpart.The discriminator model takes frames of a sparsely sampled video, independently extracts features from them, concatenates those features together channelwise into a global video descriptor and predicts the real/fake class from it. is conditioned on the time distances between frames and we use hypernetworkbased conditioning to input this information.
Overview. Generator consists of three components: content mapping network , motion mapping network and synthesis network . and are equivalent to their StyleGAN2’s counterparts with the exception that we tile and concatenate motion codes to the constant input tensor of .
A video is generated the following way. First, we sample the content noise and, following StyleGAN2, transform it into latent code . It is shared for all timesteps of a video. Then, to generate a frame in the specified time location , we first compute its motion code , which is done in three steps. First, we sample a discrete sequence of equidistant trajectory noise (we assume everywhere), positioned at distance from one another. The number of tokens is determined by the condition , i.e. it should be long enough to cover the desired timestep .^{4}^{4}4In practice, since uses paddingless convolutions, this sequence is slightly larger. We elaborate on this in Appx B. Then, we process it with conv1dbased motion mapping network with a large kernel size into the sequence . After that, we take a pair of tokens which lies between (i.e. for some and ) and compute an acyclic positional embedding
from them, described next. This positional embedding serves as the motion code for our generator. In fact, we do not need to sample all the motion noise vectors
to produce , but only those ones which depends on. In this way, our generator can produce frames nonautoregressively.Acyclic positional encoding. Traditional positional embeddings [SIREN, FourierFeatures] are cyclic by default. This does not create problems in traditional applications (like image or scene representations) because utilized spatial domain there never exceeds the period length [NeRF, INRGAN]. But for video generation, cyclicity is not desirable, because it makes a video getting looped at some point. To solve this issue, we develop the acyclic positional encoding mechanism.
A sinebased positional embedding vector can be expressed in the following form:
(1) 
where denotes elementwise vector multiplication, are amplitudes, periods and phases of the corresponding waves, and the sine function is applied elementwise. By default, these embeddings are periodic and always the same for any input [SIREN, FourierFeatures, NeRF], which is not desirable for video synthesis, where natural videos contain different motions and are typically aperiodic. To solve this issue, we compute the wave parameters from motion noise the following way. First, “raw” motion codes are computed using wave parameters predicted from :
(2) 
where
(3) 
and are learnable weight matrices. Using directly as motion codes does not lead to good results since it contains discontinuities (see Fig 8(d)). That’s why we “stitch” their start and end values via:
(4) 
where is a learnable weight matrix and lerp is the elementwise linear interpolation between and using the time position . The first subtraction in Eq (4) alters the positional embeddings to make them converge to zero values at locations . This limits the expressive power of the positional embeddings and that’s why we add the “alignment” vectors to restore it. See Fig 8(e) for the visualization.
In practice, we found it useful to compute periods as:
(5) 
where is a vector of ones and are linearlyspaced scaling coefficients. See Appx B and the source code for details.
One could try using continuous codes directly as motion codes instead of . This also eliminates cyclicity (in theory), but leads to poor results in practice: if the distance is small, then the motion trajectory will contain unnatural sharp transitions; and when is increased, loses its ability to properly model highfrequency motions (like blinking) since the codes change too slowly. We empirically validate this Tab 2 (also see samples on the project webpage).
Modern video generators typically utilize two separate discriminators which operate on image and video levels separately [DVD_GAN, MoCoGAN, MoCoGANHD]. But since we train on extremely sparse videos and aim to have a computationally efficient model, we propose to use a holistic hypernetworkbased discriminator , which is conditioned on the time distances between frames . It consists of two parts: 1) feature extractor backbone , which independently embeds an image frame into a 3D feature vector ; and the convolutional head , which takes the concatenation of all the features
and outputs the real/fake logit
.We input the time distances information between frames into the following way. First, we encode them with positional encoding [SIREN, FourierFeatures], preprocess with a 2layer MLP into and concatenate into a single vector . After that, we use to modulate the weights of each first convolutional layer in each block of and and also as a conditional vector in the projection head [ProjectionDiscriminator] in the StyleGAN2’s DiscrEpilogue block. The modulation in is equivalent to modulation in , but uses instead of
vector: it is passed through a mapping network, transformed with an affine layer and multiplied on a 4D weight tensor of a convolutional layer across the input channel axis. We do not modulate
each convolutional layer from practical considerations: in practice, ModConv2d is heavier than Conv2d. The overall architecture is visualized in Fig 6.Such an approach is greatly more efficient than using dual discriminators and since we no longer need the expensive conv3dbased discriminator, which is too expensive to operate on highresolution videos [DVD_GAN]. Moreover, as we demonstrate in Fig 4, it provides a more informative learning signal to .
Videos are continuous signals and any practical video generator relies on some sort of subsampling. The question of how many samples are necessary to train a video synthesis model is fundamental, because this design decision greatly influences the quality and training cost. In our work, we empirically show that one can train a stateoftheart video generator with as few as 2 frames per video.
Consider the problem of learning a probability distribution
and consider that we utilize sparse training, i.e. select coordinates of the vector randomly on each iteration of the optimization process. Then our optimization objective is equivalent to learning all possible marginal distributionsinstead of learning the joint distribution
. When does learning marginals allow to obtain the full joint distribution at the end? The following simple statement adds some clarity to this question.Let’s denote by a collection of sets of up to indices s.t. we have for all . In other words, is a set of up to indices . Then, can be represented as a product of marginals for if and only if there exists s.t. .
The above statement is primitive (see the proof in Appx F) but can provide useful practical intuition. For video synthesis, it implies that one can learn a video generator by using only frames per video only if for any frame , there exists at most previous frames sufficient to properly predict it (see Appx F). And we argue that for the modern video synthesis datasets, one does not need a lot of frames to make such a prediction. For example, for SkyTimelapse [SkyTimelapse_dataset], the motions are typically unidirectional and thus easily predictable from only 2 previous frames (which corresponds to training with frames per video). But surprisingly, in practice we found that using even frames per clip can provide stateoftheart performance.
We treat videos as infinite continuous signals, but in practice one has to set a limit on the maximum time location which can be seen during training. To the best of our knowledge, previous methods use at most [TGANv2, Hierarchical_video_generation], but in our case we train the model with , which does not lead to much additional computational burden due to the nonautoregressive nature of our generator. However, we set the maximum distance between and to 32 to cover shortterm and mediumterm movements: otherwise, we observed unstable training and abrupt motions in video samples. To sample frames, we first sample the distance and then . After that, frames locations for are selected at random without repetitions.
Method  FaceForensics  SkyTimelapse  UCF101  RainbowJelly 


FVD  FVD  FVD  FVD  FVD  FVD  FVD  FVD  
MoCoGAN [MoCoGAN]  124.7  257.3  206.6  575.9  2886.9  3679.0  1572.9  549.7  5  
+ StyleGAN2 backbone  55.62  309.3  85.88  272.8  1821.4  2311.3  638.5  463.0  8  
MoCoGANHD [MoCoGANHD]  111.8  653.0  164.1  878.1  1729.6  2606.5  579.1  628.2  7.5 + 9  
VideoGPT [VideoGPT]  185.9  N/A  222.7  N/A  2880.6  N/A  136.0  N/A  16 + 16  
DIGAN [DIGAN] (concurrent work)  62.5  1824.7  83.11  196.7  1630.2  2293.7  436.6  369.0  16  
StyleGANV (ours)  47.41  89.34  79.52  197.0  1431.0  1773.4  195.4  262.5  8 
Datasets. We test our model on 5 benchmarks: FaceForensics [FaceForensics_dataset], SkyTimelapse [SkyTimelapse_dataset], UCF101 [UCF101_dataset], RainbowJelly (introduced by us) and MEAD [MEAD_dataset]. We used the train splits (when they are available) for all the datasets except for UCF101, which is an extremely difficult dataset and we used train+test splits for it. We noticed that modern video synthesis datasets are either too simple or too difficult in terms of content and motion, and there are no datasets “inbetween”. That’s why we introduce RainbowJelly: a dataset of “floating” jellyfish from the Hoccori Japan youtube channel, which contains 8 hours of videos in total and has resolution. It contains simple content but complex hierarchical motions and this makes it a challenging but approachable testbed for evaluating modern video generators. We provide its details in Appx E. All the datasets have FPS, except for RainbowJelly and MEAD, which have 30 FPS.
Evaluation. Following prior work, we use Frechet Video Distance (FVD) [FVD]
and Inception Score (IS) as our evaluation metrics with FVD being the main one since FID (its imagebased counterpart) better aligns with humanperceived quality
[FID]. We use two versions of FVD: FVD and FVD, which use 16frameslong and 128frameslong videos respectively to compute their statistics. Inception Score is used only to evaluate the generation quality on UCF101 since it uses a UCF101finetuned C3D model [TGAN].The official FVD project [FVD] does not provide the complete implementation of the evaluation pipeline, but rather the inference script for a single batch of videos, which are required to be already resized to and loaded into memory. This creates discrepancies in the evaluation protocols used by previous works since FVD, (similar to FID [FID_evaluation]) is very sensitive to the interpolation procedures and perceptually unnoticeable artifacts introduced by data processing procedures, like JPEG compression. We also found it to be very sensitive to how one extracts clips from real videos to compute the statistics. We implement and release a complete FVD evaluation protocol and use it to evaluate all the methods for fair comparison. It is documented in Appx C.
Baselines. We use 5 baselines for comparison: MoCoGAN [MoCoGAN], MoCoGAN [MoCoGAN] with the StyleGAN2 [StyleGAN2] backbone, VideoGPT [VideoGPT], MoCoGANHD [MoCoGANHD] and DIGAN [DIGAN]. For MoCoGAN with the StyleGAN2 backbone (denoted as MoCoGANSG2), we replaced its generator and imagebased discriminator with the corresponding StyleGAN2’s components, leaving its video discriminator unchanged. We also used the training scheme and regularizations from StyleGAN2. MoCoGAN was trained for 5 days on a single GPU since its lightweight DCGAN[DC_GAN] backbone makes it fast to train, while MoCoGAN+SG2 was trained for 2 days on GPUs to reach 25M real images seen by its imagebased discriminator. MoCoGANHD is trained for 4.5 days on v100 GPUs, as specified in the original paper (Appx B of [MoCoGANHD]). We trained VideoGPT for the maximum affordable total time of 32 GPUdays in our resource constraints. We trained DIGAN [DIGAN]
for 5 days since we observed that by that time the metrics either plateaued or exploded (for RainbowJelly). We also noticed that DIGAN uses weighted sampling during training by selecting clips from long videos with higher probabilities. This hurts its FVD score and that’s why we altered its data sampling strategy to the uniform one, which is used by other methods
[MoCoGAN, MoCoGANHD, VideoGPT]. For each method we selected the checkpoint with the best FVD value.For the main evaluation, we train our method and all the baselines from scratch on the described datasets. Each model is trained on NVidia V100 32 GB GPUs, except for VideoGPT, which is very demanding in terms of GPU memory for resolution and we had to train it on NVidia A6000 instead (with the overall batch size of 4
). For our method and MoCoGAN+SG2, we use exactly the same optimization scheme as StyleGAN2, including the loss function, Adam optimizer hyperparameters and R1 regularization
[R1_reg]. We reduce the learning rate by 10 for the module of MoCoGAN+SG2 since it does not have equalized learning rate [ProGAN]. We use for all the experiments except for SkyTimelapse, where we used . See other training details in Appx B. We evaluate all the methods under the same evaluation protocol, described in Appx C and report the results in Table 1.To measure the efficiency, we use the amount of GPU days required to train a method. We build on top of the official StyleGAN2 implementation.^{5}^{5}5https://github.com/NVlabs/stylegan2adapytorch The training cost of the imagebased StyleGAN2 to reach its specified 25M images is NVidia V100 GPUdays in our environment. StyleGANV is trained for 2 days, which corresponds to 23M real frames seen by the discriminator. MoCoGANHD is built on top of
stylegan2pytorch
’s codebase^{6}^{6}6https://github.com/rosinality/stylegan2pytorch, which is times slower than the highly optimized NVidia’s implementation. That’s why in Table 1 we report its training cost reduced by a factor of 2 to account for this.Method  FaceForensics  SkyTimelapse  
FVD  FVD  FVD  FVD  
Default StyleGANV  47.41  89.34  79.52  197.0 
w/o our  65.88  41.77  109.1  240.2 
w/o our  154.0  139.1  236.9  258.0 
w/o hypermodulation in  88.8  161.5  69.8  184.1 
w/o projection cond in  95.4  236.0  102.1  210.3 
w LSTM codes,  131.9  159.1  135.7  196.1 
w LSTM codes,  180.3  94.55  95.71  165.8 
Number of frames  FaceForensics  SkyTimelapse  
FVD  FVD  FVD  FVD  
60.41  93.5  50.5  209.9  
(default)  47.41  89.34  79.52  197.0 
51.84  114.9  65.7  194.5  
101.9  211.4  73.12  215.9  
92.52  192.8  107.6  254.3 
Our method significantly outperforms the existing ones at almost all the benchmarks in terms FVD and FVD. On UCF101, all the methods perform poorly, which shows that it is a too difficult benchmark for modern video generators and one needs to train models of extreme scale to fit it [DVD_GAN]. For completeness, the Inception Score [TGAN] results are provided in Table 5 in Appx C.
We visualize the samples in Fig 1 and Fig 7. Our method is able to generate hourlong plausibly looking videos, though the motion diversity and global motion coherence for them would be limited (see Appx A). MoCoGANHD suffers from the LSTM instability when unrolled to large lengths and does not produce diverse motions. DIGAN produces highquality videos on SkyTimelapse because its inductive bias of having joint spatiotemporal positional information is well suited for videos that have an entire scene moving. But for FaceForensics, this leads to a “head flying away” effect (see Appx H). To generate 1hour long videos from MoCoGANHD, we unroll its LSTM model to the required depth ( steps) and synthesize frames only in the necessary time positions, while DIGAN, similar to our method, is able to generate frames nonautoregressively.
To ablate the core components, we replaced or modules with their MoCoGAN+SG2 counterparts. In the both cases, their removal leads to poor shortterm and longterm video quality, as specified by the corresponding metrics in Table 2 and video samples in the supplementary.
Replacing continuous motion codes with , produced by the LSTM model hurts the performance, especially when the distance between motion codes is small. This happens due to unnaturally abrupt transitions between frames and we provide the corresponding samples in the supplementary. The corresponding results are in Table 2.
Another architectural decision which we verify is the conditioning scheme utilized in . We use both ModConv2d blocks and the projection conditioning head [ProjectionDiscriminator] in DiscrEpilogue in our discriminator to provide the conditioning signal about the time differences between frames . Removing any of them hurts the performance, because it constrains the ability of to understand the temporal scale it is currently operating on. We ablated the hypernetworkbased modulation by feeding a vector of zeros instead of positional embeddings of time differences into ModConv2d
so that only the bias vectors in the corresponding affine transforms participate in the modulation process.
An important design choice is how many samples per video one should use during training. We try different values of for and and report the corresponding results in Table 3. As being discussed in Sec 3.3, for existing video generation benchmarks, it might be enough to sample only several frames per each video, and our experiments confirm this observation. The performance is decreased for larger , but this might be attributed to a weaker temporal aggregation procedure of , which simply concatenates features together. It is surprising to see that modern datasets can be fit with as few as 2 samples per video.
Our generator is able to generate arbitrarily long videos. Our design of motion codes allows StyleGANV not to suffer from stability problems when unrolled to large (potentially infinite) video lengths. This is verified by visualizing the video clips for the extremely large timesteps in Fig 1 and Fig 8. We also demonstrate its ability to produce videos in arbitrarily high framerate in the supplementary.
Our model has the same latent space manipulation properties as StyleGAN2. To show this, we conduct two experiments: embedding, editing and animating an offtheshelf image and editing and animating the first frame of a generated video. To embed an image, we used the optimization procedure similar to [Image2StyleGAN], but considering it to be positioned at . To edit an image with CLIP, we used the procedure of [StyleCLIP]. The results of these experiments are visualized in Fig 2 and we provide the details in Appx B and more examples in the supplementary. Apart from showing the good properties of its latent space, these experiments demonstrate the extrapolation potential of our generator.
StyleGANV has almost the same training efficiency and image quality as StyleGAN2. In Fig 3, we plot the FID scores and training costs of modern video generators on FaceForensics by their corresponding FVD scores. Our method comes very close to StyleGAN2: it converges to FID of 9.44 in 8 GPUdays compared to FID of 8.42 in 7.72 GPUdays for StyleGAN2, which is only 10% worse. This raises a question of whether video generators can be as computationally efficient and good in terms of image quality as imagebased ones.
Our model is the first one which is directly trainable on resolution. We provide the generations on MEAD for our method and for MoCoGANHD. MoCoGANHD cannot preserve the identity of a speaker and diverges for large video lengths, while our method achieves comparable image quality and coherent motions. For this dataset, our model was trained for 7 days on NVidia v100 GPUs and obtained FID of 24.12 and FVD of 156.1. Image generator for MoCoGANHD was trained for days on A6000 GPUs, while its video generator was trained for only days since it didn’t require highresolution training.
Our discriminator provides more informative learning signal to . Fig 4 visualizes the gradient signal to the generator from our discriminator and the conv3dbased video discriminator of MoCoGANHD, measured at of training for our method (at 10M images seen by
) and MoCoGANHD (at the 300th epoch). In our case, one can easily see finegrained details of the face structure, perceived by
, while in case of MoCoGANHD, most of the gradient is redundant and lack any structural information.Content and motion decomposition. Similar to MoCoGAN [MoCoGAN], our generator captures content and motion variations in a disentangled manner: altering motion codes while fixing does not change the appearance variations (like, a speaker’s identity). Similarly, resampling does not influence motion patterns on a video, but only its content. We provide the corresponding visualizations on the project website.
In this work, we provided a different perspective on time for video synthesis and built a continuous video generator using the paradigm of neural representations. For this, we developed motion representations through the lens of positional embeddings, explored sparse training of video generators and redesigned a typical dual structure of a video discriminator. Our model is built on top of StyleGAN2 and features a lot of its perks, like efficient training, good image quality and editable latent space.
We hope that our work would serve as a solid basis for building more powerful video generators in the future. The limitations and potential negative impact are discussed in Appx A.
Our model has the following limitations:
Limitations of sparse training. In general, sparse training makes it impossible for to capture complex dependencies between frames. But surprisingly, it provides stateoftheart results on modern datasets, which (using the statement from Sec 3.3) implies that they are not that sophisticated in terms of motion.
Datasetinduced limitations
. Similar to other machine learning models, our method is bound by the dataset quality it is trained on. For example, for FaceForensics
dataset [FaceForensics_dataset], our embedding and manipulations results are inferior to StyleGAN2 ones [Image2StyleGAN]. This is due to the limited number of identities (just 700) in FaceForensics and their larger diversity in terms of quality compared to FFHQ [StyleGAN], which StyleGAN2 was trained on.Periodicity artifacts. still produces periodic motions sometimes, despite of our acyclic positional embeddings. Future investigation on this phenomena is needed.
Poor handling of new content appearing. We noticed that our generator tries to reuse the content information encoded in the global latent code as much as possible. It is noticeable on datasets where new content appears during a video, like Sky Timelapse or Rainbow Jelly. We believe it can be resolved using ideas similar to ALIS [ALIS].
Sensitivity to hyperparameters. We found our generator to be sensitive to the minimal initial period length (See Appx B). We increased it for SkyTimelapse [SkyTimelapse_dataset] from 16 to 256: otherwise it contained unnatural sharp transitions.
We plan to address those limitations in our future works.
The potential negative impact of our method is similar to those of traditional imagebased GANs: creating “deepfakes” and using them for malicious purposes.^{7}^{7}7https://en.wikipedia.org/wiki/Deepfake. Our model made it much easier to train a model which produces much more realistic video samples with a small amount of computational resources. But since the availability of highquality datasets is very low for video synthesis, the resulted model will fall short compared to its imagebased counterpart, which could use rich, extremely qualitative image datasets for training, like FFHQ [StyleGAN].
Note, that all the details can be found in the source code: https://github.com/universome/styleganv.
Our model is built on top of the official StyleGAN2ADA [StyleGAN2ADA] repository^{8}^{8}8https://github.com/nvlabs/stylegan2ada. In this work, we build a model to generate continuous videos and a reasonable question to ask was why not use INRGAN [INRGAN] instead (like DIGAN [DIGAN]) to have fully continuous signals? The reason why we chose StyleGAN2 instead of INRGAN is that StyleGAN2 is amenable to the mixedprecision training, which makes it train times faster. For INRGAN, enabling mixed precision severely decreases the quality and we hypothesize the reason if it is that each pixel in INRGAN’s activations tensor carries more information (due to the spatial independence) since the model cannot spatially distribute information anymore. And explicitly restricting the range of possible values adds a strict upper bound on the amount of information one each pixel is able to carry. We also found that adding coordinates information does not improve video quality for our generator neither qualitatively, nor in terms of scores.
Similar to StyleGAN2, we utilize nonsaturating loss and regularization with the loss coefficient of 0.2 in all the experiments, which is inherited from the original repo and we didn’t try any hyperparameter search for it. We also use the fmaps parameter of 0.5 (the original StyleGAN2 used fmaps parameter of 1.0), which controls the channel dimensionalities in and , since it is the default setting for StyleGAN2ADA for resolution. This allowed us to further speedup training.
The dimensionalities of are all set to 512.
As being stated in the main text, we use a paddingless conv1dbased motion mapping network with a large kernel size to generate raw motion codes . In all the experiments, we use the kernel size of
and stride of
. We do not use any dilation in it despite the fact that they could increase the temporal receptive field: we found that varying the kernel size didn’t produce much benefit in terms of video quality. Using paddingless convolutions allows the model to be stable when unrolled at large depths. We use 2 layers of such convolutions with a hidden size of 512. Another benefit of using conv1dbased blocks is that in contrast to LSTM/GRU cells one can practically incorporate equalized learning rate [ProGAN] scheme into it.Using conv1dbased motion mapping network without paddings forces us to use “previous” motion noise codes . That’s why instead of sampling a sequence , we sample a slightly larger one to adjust for the reduced sequence size. For the samepadding strategy, for sampling a frame at position , we would need to produce motion noise codes . But with our kernel size of 11, with 2 layers of convolutions and without padding, the resulted sequence size is .
The training performance of VideoGPT on UCF101 is surprisingly low despite the fact that it was developed for such kind of datasets [VideoGPT]. We hypothesize that this happens due to UCF101 being a very difficult dataset and VideoGPT being trained with the batch size of 4 (higher batch size didn’t fit our 200 GB GPU memory setup), which damaged its ability to learn the distribution.
To train our model, we also utilized adaptive differentiable augmentations of StyleGAN2ADA [StyleGAN2ADA], but we found it important to make them videoconsistent, i.e. applying the same augmentation for each frame of a video. Otherwise, the discriminator starts to underperform, and the overall quality decreases. We use the default bgc augmentations pipe from StyleGAN2ADA, which includes horizontal flips, 90 degrees rotations, scaling, horizontal/vertical translations, hue/saturation/brightness/contrast changes and luma axis flipping.
While training the model, for real videos we first select a video index and then we select random clip (i.e., a clip with a random offset). This differs from the traditional DIGAN or VideoGPT training scheme, that’s why we needed to change the data loaders to make them learn the same statistics and not get biased by very long videos.
To develop this project, NVidia v100 32GB GPUyears + NVidia A6000 GPUyears were spent.
In this subsection, we describe the embedding and editing procedures, which were used to obtain results in Fig 2.
Projection. To project an existing photogrpah into the latent space of , we used a procedure from StyleGAN2 [StyleGAN2], but projecting into space [Image2StyleGAN] instead of , since it produces better reconstruction results and does not spoil editing properties. We set the initial learning rate to and optimized a code for LPIPS reconstruction loss [LPIPS] for 1000 steps using Adam. For motion codes, we initializated a static sequence and kept it fixed during the optimization process. We noticed that when it is also being optimized, the reconstruction becomes almost perfect, but it breaks when another sequence of motion codes is plugged in.
Editing. Our CLIP editing procedure is very similar to the one in StyleCLIP [StyleCLIP], with the exception that we embed an image assuming that it is a video frame in location . On each iteration, we resample motion codes since all our edits are semantic and do not refer to motion. We leave the motion editing with CLIP for future exploration. For the sky editing video presented in Fig 2, we additionally utilize masking: we initialize a mask to cover the trees and try not to change them during the optimization using LPIPS loss. For all the videos, presented in the supplementary website, no masking is used.
The details can be found in the provided source code.
Mitigating highfrequency artifacts. We noticed that if our periods are left unbounded, they might grow to very large values (up to magnitude of ), which corresponds to extra high frequencies (the period length becomes less than 4 frames) and leads to temporal aliasing. That’s why we process them via the transform: this bounds them into range with the mean of 1.0, i.e. using the atinitialization frequency scaling, which we discuss next.
Linearly spaced periods. An important design decision is the scaling of periods since at initialization it should cover both highfrequency and lowfrequency details. Existing works use either exponential scaling (e.g., [NeRF, Nerfies, ALIS, SAPE]) or random scaling (e.g., [SIREN, FourierFeatures, INRGAN, CIPS]). In practice, we scale the th column of the amplitudes weight matrix with the value:
(6) 
where we use frames and frames in all the experiments, except for SkyTimelapse, for each we use . We call this scheme linear scaling and use it as an additional tool to alleviate periodicity since it greatly increases the overall cycle of a positional embedding (see Fig 9). See also the accompanying source code for details.
Another benefit of using our positional embeddings over LSTM is that they are “always stable”, i.e. they are always in a suitable range.
For the practical implementation, see the provided source code: https://github.com/universome/styleganv.
In this section, we describe the difficulties of a fair comparison of the FVD score. There are discrepancies between papers in computing even FID [FID_evaluation]. So, it is less surprising that computing FVD for videos diverge even more and has even more implications for methods evaluation.
First, we note that I3D model [i3d] has different weights on tf.hub https://tfhub.dev/deepmind/i3dkinetics400/1 — the model which is used in the official FVD repo.^{9}^{9}9https://github.com/googleresearch/googleresearch/blob/master/frechet_video_distance — compared to its official release in the official github repo implementation ^{10}^{10}10https://github.com/deepmind/kineticsi3d That’s why we manually exported the weights from tf.hub and used this github repo ^{11}^{11}11https://github.com/hassony2/kinetics_i3d_pytorch to obtain an exact implementation in Pytorch.
There are several issues with FVD metric on its own. First, it does not capture motion collapse, which can be observed by comparing FVD and FVD scores between StyleGANV and StyleGANVwith LSTM motion codes instead of our ones: the latter one has a severe motion collapse issue (see the samples on our website) and has similar or lower FVD scores compared to our model: 196.1 or 165.8 (depending on the distance between anchors) vs 197.0 for our model. Another issue with FVD calculation is that it is biased towards image quality. If one trains a good image generator, i.e. a model which is not able to generate any videos at all, then FVD will still be good for it even despite the fact that it would have degenerate motion.
We also want to make a note on how we compute FID for vidoe generators. For this, we generate 2048 videos of 16 frames each (starting with ) and use all those frames in the FID computation. In this way, it gives 33k images to construct the dataset, but those images will have lower diversity compared to a typically utilized 50ksized set of images from a traditional image generator [StyleGAN]. The reason of it is that 16 images in a single clip likely share a lot of content. A better strategy would be to generate 50k videos and pick a random frame from each video, but this is too heavy computationally for models which produce frames autoregressively. And using just the first frame in FID computation will unfairly favour MoCoGANHD, which generates the very first frame of each video with a freezed StyleGAN2 model.
FVD is greatly influenced by 1) how many clips per video are selected; 2) with which offsets; and 3) at which framerate. For example, SkyTimelapse contains several extremely long videos: if we select as many clips as possible from each real video, that it will severely bias the statistics of FVD. For FaceForensics, videos often contain intro frames during their first 0.51.0 seconds, which will affect FVD when a constant offset of is chosen to extract a single clip per video.
That’s why we use the following protocol to compute FVD.
Computing real statistics. To compute real statistics, we select a single clip per video, chosen at a random offset. We use the actual framerate of the dataset, which the model is being trained on, without skipping any frames. The problem of such an approach is that for datasets with small number of long videos (like, FaceForensics, see Table 7
) might have noisy estimates. But our results showed that the standard deviations are always
even for FaceForensics . The largest standard deviation we obserbed was when computing FVD on RainbowJelly: on this dataset it was for VideoGPT, but it is of its overall magnitude.Computing fake statistics. To compute fake statistics, we generate 2048 videos and save them as frames in JPEG format via the Pillow library. We use the quality parameter for doing this, since it was shown to have very close quality to PNG, but without introducing artifacts that would lead to discrepancies [FID_evaluation]. Ideally, one would like to store frames in the PNG format, but in this case it would be too expensive to represent video datasets: for example, MEAD would occupy terabytes of space in this case.
Method  FF  ST 
Proper computation  76.82  61.95 
When resized to  38.92  59.86 
With jpg/png discrepancy  80.17  71.40 
When using all clips per video  84.59  72.03 
When using only first frames  91.64  59.74 
When using subsampling of  82.88  90.21 
Still real images  342.5  166.8 
We illustrate the subtleties of FVD computation in Table 4. For this, we compute real/fake statistics for our model in several different ways:
Resized to . Both fake and real statistics images are resized into resolution via the pytorch bilinear interpolation (without corners alignment) before computing FVD.
JPG/PNG discrepancy. Instead of saving fake frames in JPG with , we use parameter in the PIL library. This creates more JPEGlike artifacts, which, for example, FID is very sensitive to.
Using all clips per video. We use all available frameslong clips in each video without overlaps. Note, that our model was trained
Using only first frames. In each real video, instead of using random offsets to select clips, we use the first frames.
Using subsampling. When sampling frames for computing real/fake statistics, we select each th frame. This is the strategy which was employed for some of the experiments in the original paper [FVD] — but in their case, authors trained the model on videos with this subsampling.
For completeness, we also provide the Inception Score [TGAN] on UCF101 dataset in Table 5. Note that is computed by resizing all videos to spatial resolution (due to the internal structure of the C3D [c3d] model), which makes it impossible for it to capture highresolution details of the generated videos, which is the focus of the current work.
Method  Inception Score [TGAN] 
MoCoGAN [MoCoGAN]  10.090.30 
MoCoGAN+SG2 (ours)  15.260.95 
VideoGPT [VideoGPT]  12.610.33 
MoCoGANHD [MoCoGANHD]  23.391.48 
DIGAN [DIGAN]  23.161.13 
StyleGANV (ours)  23.940.73 
Real videos  97.230.38 
In Tab 6, we provide the numbers, used in Fig 3. Note that StyleGAN2 training in our case is slightly slower than the officially specified one (7.3 vs 7.7 GPU days)^{12}^{12}12https://github.com/NVlabs/stylegan2adapytorch, which we attribute to a slightly slower file system on our computational cluster.
Method  FVD  FID  Training cost 
MoCoGAN [MoCoGAN]  124.7  23.97  5 
MoCoGAN+SG2 (ours)  55.62  10.82  8 
VideoGPT [VideoGPT]  185.9  22.7  32 
MoCoGANHD [MoCoGANHD]  111.8  7.12  16.5 
DIGAN [DIGAN]  62.5  19.1  16 
StyleGANV (ours)  47.41  9.445  8 
StyleGAN2 [StyleGAN2ADA]  N/A  8.42  7.72 
In this section, we provide a list of ideas, which we tried to make work, but they didn’t work either because the idea itself is not good, or because we didn’t put enough experimental effort into investigating it.
Hierarchical motion codes. We tried having several layers of motion codes. Each layer has its own distance between the codes. In this way, highlevel codes should capture highlevel motion and bottomlevel codes should represent short local motion patterns. This didn’t improve the scores and didn’t provide any disentanglement of motion information. We believe that the motion should be represented differently (similar to FOMM [FOMM]), rather than with motion codes, because they make it difficult for to make them temporily coherent.
Maximizing entropy of motion codes to alleviate motion collapse. As an additional tool to alleviate motion collapse, we tried to maximize entropy of wave parameters of our motion codes. The generator solved the task of maximizing the entropy well, but it didn’t affect the motion collapse: it managed to save some coordination dimensions of specifically to synchronize motions.
Prorgressive growing of frequences in positional embeddings. We tried starting with lowfrequencies first and progressively open new and new ones during the training. It is a popular strategy for training implicit neural representations on reconstruction tasks (e.g., [Nerfies, SAPE]), but in our case we found the following problem with it. The generator learned to use low frequencies for representing highfrequency motion and didn’t learn to utilize high frequencies for this task when they became available. That’s why highfrequency motion patterns (like blinking or speaking) were unnaturally slow.
Continuous LSTM with EMA states. Our motion codes use sine/cosine activations, which makes them suffer from periodic artifacts (those artifacts are mitigated by our parametrization, but still present sometimes). We tried to use LSTM, but with exponential moving average on top of its hidden states to smoothen out motion representations temporally. However, (likely due to the lack of experimental effort which we invested into this direction), the resulted motion representations were either too smooth or too sharp (depending on the EMA window size), which resulted in unnatural motions.
Concatenating spatial coordinates. INRGAN [INRGAN] uses spatial positional embeddings and shows that they provide better geometric prior to the model. We tried to use them as well in our experiments, but they didn’t provide any improvement neither in qualitatively, nor quantitatively, but made the training slightly slower (by %10) due to the increased channel dimensionalities.
Feature differences in . Another experiment direction which we tried is computing differences between activations of next/previous frames in a video and concatenating this information back to the activations tensor. The intuition was to provide information with some sort of “latent” optical flow information. However, it made too powerful (its loss became smaller than usual) and it started to outpace too much, which decreased the final scores.
Predicting instead of conditioning in . There are two ways to utilize the time information in : as a conditioning signal or as a learning signal. For the latter one, we tried to predict the time distances between frames by training an additional head to predict the class (we treated the problem as classification instead of regression since there is a very limited amount of time distances between frames which sees during its training). However, it noticeably decreased the scores.
Conditioning on video length. For unconditional UCF101, it might be very important for to know the video length in advance. Because some classes might contain very short clips (like, jumping), while others are very long, and it might be useful for to know in advance which video it will need to generate (since we sample frames at random time locations during training). However, utilizing this conditioning didn’t influence the scores.
We provide the dataset statistics in Fig 10 and their comparison in Table 7. Note, that for MEAD, we use only its front camera shots (originally, it releases shots from several camera positions).
Dataset  #hours  avg len  FPS  #speakers 
FaceForensics [FaceForensics_dataset]  4.04  20.7s  25  
SkyTimelapse [SkyTimelapse_dataset]  12.99  22.1s  25  N/A 
UCF101 [UCF101_dataset]  0.51  6.8s  25  N/A 
RainbowJelly  7.99  17.1s  30  N/A 
MEAD [MEAD_dataset]  36.11  4.3s  30 
For our RainbowJelly benchmark, we used the following film: https://www.youtube.com/watch?v=P8Bit37hlsQ. It is an 8hourlong movie of jellyfish in 4K resolution and 30 FPS from the Hoccori Japan youtube video channel. We cannot release this dataset due to the copyright restrictions, but we released a full script which processes it (see the provided source code). To construct a benchmark, we sliced it into 1686 chunks of 512 frames each, starting with the 150th frame (to remove the loading screen), centercropped and resized into resolution. This benchmark is advantageous compared to the existing ones in the following way:
It contains complex hierarchical motions:
a jellyfish flowing in a particular direction (lowfrequency global motion);
a jellyfish pushing water with its arms (mediumfrequency motion)
small perturbations of jellyfish’s body and tentacles (highfrequency local motion).
It is a very highquality dataset (4K resolution).
It is simple in terms of content, which makes the benchmark more focused on motions.
It contains long videos.
In this section, we elaborate on our simple theoretical exposition from Sec 3.3
Consider that we want to fit a probabilistic model to the real data distribution . For simplicity, we will be considering a discrete finite case, i.e. , but note that videos, while continuous and infinite in theory, are still discretized and have a time limit to fit on a computer in practice. For fitting the distribution, we use sparse training, i.e. picking only random coordinates from each sample during the optimization process. In other words, introducing sparse sampling reformulates the problem from
(7) 
into
(8) 
where is a problemspecific distance function between probability distributions, is a collection of all possible sets of unique indices and denotes a subvector of . This means, that instead of bridging together full distributions we choose to bridge all their possible marginals of length instead. When solving Eq. (8) will help us to obtain the full joint distribution ? To investigate this question, we develop the following simple statement.
Let’s denote by a collection of sets of up to indices s.t. we have for all .
Using the chain rule, we can represent
as:(9) 
where denotes the sequence . Now, if we know that for each , there exists with s.t.:
(10) 
then is obviously simplified to:
(11) 
Does this tell anything useful? Surprisingly, yes. It says that if is simple enough that instead of using the whole history to model
it’s enough to use only some set “representative moments”
(unique for each ) with the size , then sparse training is a viable alternative. After fitting via sparse training, we will be able to obtain using Eq (10) even though ! Note, that one can obtain a conditional distributional from the marginal one for some set of indicies via:(12) 
But we would also like to have the “reverse” dependency, i.e. knowing that if we can approximate the distribution via a set of marginals, then this distribution is not too difficult. For this claim, we will need to consider marginals not of an arbitrary form , but of the form , and we would need exactly of those. The reverse implication is the following. If can be represented as a product of conditionals , then for each there exists s.t. .
This statement, just like the previous one, looks obvious. But oddly, requires more than a single sentence to prove. First, we are given that:
(13) 
but unfortunately, we cannot directly claim that each term in the product equals to its corresponding one in the product . For this, we first need to show that for each we have:
(14) 
It can be seen from the fact, that:
(15) 
This allows to cancel terms in the chain rule one by one, starting from the end, leading to the desired equality:
(16) 
Does this reverse claim tells us anything useful? Surprisingly again, yes. It implies that if we managed to fit by using sparse training, then this distribution is not sophisticated.
Merging the above two statements together, we see that can be represented as a product of conditionals for if and only if for all there exists s.t. .
What does this statement tell for video synthesis? Any video synthesis algorithm utilizes sparse training to learn its underlying model, but in contrast to prior work, we use very small values of . This means, that we fit our model to model any marginals of (considering that we pick frames uniformly at random) instead of the full one . And using the above statement, such a setup implies the assumption of Eq (10). This equation says that one can know everything about by just observing previous frames . In other words, must be predictable from . Moreover, it is easy to show that our statement can generalize to include several for th frame, i.e. there might exist several explainable sets of frames.
For the ease of visualization, we provide additional samples of the model via a web page: https://universome.github.io/styleganv.
Our model shares a lot of similarities to DIGAN [DIGAN] and in this section we highlight those similarities and differences.
Sparse training. DIGAN also utilizes very sparse training (only 2 frames per video). But in our case, we additionally explore the optimal number of frames per video (see Sec 3.3).
Continuoustime generator. DIGAN also builds a generator, which is continuous in time. But our generator does not lose the quality at infinitely large lengths.
Dropping conv3d blocks. DIGAN also drops conv3d blocks in their discriminator. But in contrast to us, they still have 2 discriminators.
Motion representation. DIGAN uses only a single global motion code, which makes it theoretically impossible to generate infinite videos: at some point it will start repeating itself (due to the usage of sine/cosinebased positional embeddings). In our case, we use an infinite sequence of motion codes, which are being temporally interpolated, computed wave parameters from and transformed into motion codes. DIGAN mixes temporal and spatial information together into the same positional embedding, which creates the following problem: even when time changes, the spatial location, perceived by the model, also changes. This creates a “headflyingaway” effect (see the samples). In our case, we keep these two information sources decomposed from one another.
Generator’s backbone. DIGAN is built on top of INRGAN [INRGAN], while our work uses StyleGAN2. This allows DIGAN to inherit INRGAN’s benefits from being spatially continuous, but at the expense of being less stable and being slower to train (due to the lack of mixed precision and increased channel dimensionalities from concatenating positional embeddings).
Discriminator structure. DIGAN uses two discriminators: the first one operates on imagelevel and is equivalent to StyleGAN2’s one, while the other one operates on “video” level and takes frames and the time differences between them , concatenates them all together into a 7channel input image (tiling the time difference scalar) and passes into a model with StyleGAN2 discriminator’s backbone. In our case, we utilize a single hypernetworkbased discriminator.
Sampling procedure. We use samples per video, while DIGAN uses . Also, we sample frames uniformly randomly, while DIGAN selects and (in this way, DIGAN sometimes have ). Apart from that, they use .
Apart from those major distinctions, there are lot of small implementation differences. We refer an interested reader to the released codebases for them:
StyleGANV: https://github.com/universome/styleganv
INRGAN demonstrated that it has higher throughput than StyleGAN2 in terms of images/second [INRGAN]. But the authors compare to the original StyleGAN2 implementation and not to the one from StyleGAN2ADA repo, which is much better optimized. Also, they use caching of positional embeddings which is only possible at testtime and has great influence on its computational performance. In this way, we found that that StyleGAN2 is times faster to train and is less consuming in terms of GPU memory than INRGAN.
DIGAN is based on top of INRGAN and that’s why suffers from the issues described above. We trained it for a week on v100 NVidia GPUs and observed that it stopped improving after days of training. This is equivalent to real frames seen by the discriminator (while MoCoGAN+SG2 and StyleGANV reach in just 2 days for the same resolution in the same environment). For the time of the submitting the main paper, there was no information about the training cost. However, the authors updated their manuscript for the time of submitting the supplementary and specify the training cost of 8 GPUdays resolution, which is consistent with our experiments (considering that we have twice as larger resolution).