Source code for StyleGAN-V
Videos show continuous events, yet most - if not all - video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be - time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator. For this, we first design continuous motion representations through the lens of positional embeddings. Then, we explore the question of training on very sparse videos and demonstrate that a good generator can be learned by using as few as 2 frames per clip. After that, we rethink the traditional image and video discriminators pair and propose to use a single hypernetwork-based one. This decreases the training cost and provides richer learning signal to the generator, making it possible to train directly on 1024^2 videos for the first time. We build our model on top of StyleGAN2 and it is just 5 resolution while achieving almost the same image quality. Moreover, our latent space features similar properties, enabling spatial manipulations that our method can propagate in time. We can generate arbitrarily long videos at arbitrary high frame rate, while prior work struggles to generate even 64 frames at a fixed rate. Our model achieves state-of-the-art results on four modern 256^2 video synthesis benchmarks and one 1024^2 resolution one. Videos and the source code are available at the project website: https://universome.github.io/stylegan-v.READ FULL TEXT VIEW PDF
Source code for StyleGAN-V
Recent advances in deep learning pushed image generation to the unprecedented photo-realistic quality[StyleGAN2-ADA, BigGAN] and spawned a lot of its industry applications. Video generation, however, does not enjoy a similar success and struggles to fit complex real-world datasets. The difficulties are caused not only by the more complex nature of the underlying data distribution, but also due to the computationally intensive video representations employed by modern generators. They treat videos as discrete sequences of images, which is very demanding for representing long high-resolution videos and induces the use of expensive conv3d-based architectures to model them [TGAN, MoCoGAN, TGANv2, DVD_GAN]. 222E.g., DVD-GAN [DVD_GAN] requires K to train at resolution (as reported by [MoCoGAN-HD])
In this work, we argue that this design choice is not optimal and propose to treat videos in their natural form: as continuous signals , that map any time coordinate into an image frame . Consequently, we develop a GAN-based continuous video synthesis framework by extending the recent paradigm of neural representations [NeRF, SIREN, FourierFeatures] to the video generation domain.
Developing such a framework comes with three challenges. First, sine/cosine positional embeddings are periodic by design and depend only on the input coordinates. This does not suit video generation, where temporal information should be aperiodic (otherwise, videos will be cycled) and different for different samples. Next, since videos are perceived as infinite continuous signals, one needs to develop an appropriate sampling scheme to use them in a practical framework. Finally, one needs to accordingly redesign the discriminator to operate in the new sampling pipeline.
To solve the first issue, we develop positional embeddings with time-varying wave parameters which depend on motion information, sampled uniquely for different videos. This motion information is represented as a sequence of motion codes produced by a padding-less conv1d-based model. We prefer it over the usual LSTM network [MoCoGAN, MoCoGAN-HD, TGANv2, VideoGLO] to alleviate the RNN’s instability when unrolled to large depths and to produce frames non-autoregressively.
Next, we investigate the question of how many samples are needed to learn a meaningful video generator. We argue that it can be learned from extremely sparse videos (as few as 2 frames per clip), and justify it with a simple theoretical exposition in Sec 3.3 and practical experiments (see Table 2).
Finally, since our model sees only 2-4 randomly sampled frames per video, it is highly redundant to use expensive conv3d-blocks in the discriminator, which are designed to operate on long sequences of equidistant frames. That’s why we replace it with a conv2d-based model, which aggregates information temporarily via simple concatenation and is conditioned on the time distances between its input frames. We use hypernetwork-based modulation [Hypernetworks] for this conditioning to make the discriminator more flexible in processing frames sampled at varying time distances. Such redesign improves training efficiency (see Table 1), provides more informative gradient signal to the generator (see Fig 4) and simplifies the overall pipeline (see Sec 3.2), since we no longer need two different discriminators to operate on image and video levels separately, as modern video synthesis models do (e.g., [MoCoGAN, TGANv2, DVD_GAN]).
We build our model, named StyleGAN-V, on top of the image-based StyleGAN2 [StyleGAN2]. It is able to produce arbitrarily long videos at arbitrarily high frame-rate in a non-autoregressive manner and enjoys great training efficiency — it is only costlier than the classical image-based StyleGAN2 model [StyleGAN2], while having only worse plain image quality in terms of FID [FID] (see Fig 3). This allows us to easily scale it to HQ datasets and we demonstrate that it is directly trainable on resolution.
For empirical evaluation, we use 5 benchmarks: FaceForensics [FaceForensics_dataset], SkyTimelapse [SkyTimelapse_dataset]
, UCF101[UCF101_dataset], RainbowJelly (introduced in our work) and MEAD [MEAD_dataset]. Apart from our model, we train from scratch 5 different methods and measure their performance using the same evaluation protocol. Frechet Video Distance (FVD) [FVD] serves as the main metric for video synthesis, but there is no complete official implementation for it (see Sec 4 and Appx C). This leads to discrepancies in the evaluation procedures used by different works because FVD, similarly to FID [FID], is very sensitive to data format and sampling strategy [FID_evaluation]. That’s why we implement, document and release our complete FVD evaluation protocol. In terms of sheer metrics, our method performs on average better than the closest runner-up.
Early works on video synthesis mainly focused on video prediction [SiftFlow, UnsupervisedVisualPrediction], i.e. generating future frames given a sequence of the previously seen ones.
Early approaches for this problem typically employed recurrent convolutional models trained with reconstruction objective [video_language_modeling, Robot_pushing_dataset, LSTMs_video_representations], but later adversarial losses were introduced to improve the synthesis quality [MultiScaleVideoPrediction, Video2Video, GeneratingVideosWithSceneDynamics].
Some recent works explore autoregressive video prediction with recurrent or attention-based models (e.g., [VideoTransformer, LVT, VideoGPT, PredictingVideoWithVQVAE, VideoPixelNetworks]).
Another close line of research is video interpolation
video interpolation, i.e. increasing the frame rate of a given video (e.g., [Video_interpolation_ASC, Video_interpolation_DAIN, Video_interpolation_SuperSloMo]). In our work, we study video generation, which is a more challenging problem than video prediction since it seeks to synthesize videos from scratch, i.e. without using the expressive conditioning on previous frames. Classical methods in this direction are typically based on GANs [GANs]. MoCoGAN [MoCoGAN] and TGAN [TGAN] decompose generator’s input noise into a content code and motion codes, which became a standard strategy for many subsequent works (e.g., [MoCoGAN-HD, TGANv2, VideoGLO, TemporalShiftGAN]). SVGAN [SelfSupervisedVideoGANs] additionally add self-supervision losses to improve the synthesis.
MoCoGAN-HD [MoCoGAN-HD] and StyleVideoGAN [StyleVideoGAN], similar to us, consider high-resolution video synthesis. But in their case, the authors perform indirect training by training a motion codes generator in the latent space of a pretrained StyleGAN2 model. StyleGAN-V is trained on extremely sparse videos. This makes it related to [TGANv2, Hierarchical_video_generation, Inmodegan], which use a pyramid of discriminators operating on different temporal resolutions (with a subsampling factor of up to ). In contrast to the prior work, our generator is continuous in time. In this way it is similar to Vid-ODE[VidODE]: a continuous-time video interpolation and prediction model based on neural ODEs [NODE].
To the best of our knowledge, all modern video synthesis approaches utilize expensive conv3d blocks either in their decoder and/or encoder components (e.g., [MoCoGAN, MoCoGAN-HD, TGANv2, TemporalShiftGAN, DVD_GAN, LDVDGAN, ProVGAN]). Often, GAN-based approaches utilize two discriminators, operating on image and video levels independently, where the video discriminator operates at a low resolution to save computation (e.g., [MoCoGAN, G3AN, MoCoGAN-HD, DVD_GAN]). In our work, we show that it’s enough to use a single holistic hypernetwork-based [Hypernetworks] discriminator conditioned on the time difference between frames to build a state-of-the-art temporarily coherent video generator. For conditioning, we use ModConv2d blocks from [StyleGAN2], which is similar to AnyCostGAN [AnyCostGAN].
. Neural representations is a recent paradigm that uses neural networks to represent continuous signals, such as images, videos, audios, 3D objects and scenes (e.g.,[NeRF, SIREN, FourierFeatures, SRNs, TemplateImplicitFunction]). It is mostly popular for 3D reconstruction and geometry processing tasks (e.g., [DeepSDF, DeepMeta, OccupancyNetworks, ConvolutionalOccupancyNetworks, DVR]), including video-based reconstruction [Nerfies, SpaceTimeNeIF, D-nerf, NSFF]. Several recent projects explored the task of building generative models over such representations to synthesize images (e.g., [INR-GAN, CIPS, ALIS]), 3D objects (e.g., [GRAF, piGAN, NeRF-VAE]) or multi-modal signals (e.g., [INRs_distribution, GEM]), and our work extends this line of research to video generation.
Concurrent works. The development of neural representations-based approaches moves extremely fast and there are two concurrent works which propose ideas similar to our ones. DIGAN[DIGAN] is a concurrent project that explores the same direction of using neural-based representations for continuous video synthesis and shares a lot of ideas with our work. The authors also consider a continuous-time generator, trained by a discriminator without conv3d layers. The core difference with our work is that they use a different parametrization of motions and use a dual discriminator : one operates on and the second one on individual images. We enumerate the differences and similarities in Appx H. NeRV [NeRV] is another concurrent project which proposes to represent videos as convolutional neural representations. But in their case, the authors explore compression and denoising tasks. GEM [GEM] utilizes generative latent optimization [GLO] to build a multi-modal generative model.
Our model is based on the paradigm of neural representations [NeRF, SIREN, FourierFeatures], i.e. representing signals as neural networks. We treat each video as a function which is continuous in time . In this manner, the training dataset is a set of subsampled signals , where denotes the total number of videos, denotes the time position of the -th frame and is the amount of frames in the -th video333To simplify the notation, we assume that all videos have the same frame-rate and that all the videos were sampled starting at , but it is not a limitation of the method. Note that each video might have a different length and in practice these lengths vary a lot (see Appx E for datasets statistics). Our goal is to train a generative model over video signals, having only their subsampled versions. To achieve this, we develop the following framework.
We build the model on top of StyleGAN2 [StyleGAN2-ADA] and redesign its generator and discriminator networks for video synthesis. Our generator is conceptually similar to MoCoGAN [MoCoGAN], i.e., we separate latent information into content code and motion trajectory . In contrast to MoCoGAN, our motion codes are continuous in time and we describe their design in Sec 3.1. The only modification we do on top of StyleGAN2’s generator is the concatenation of our continuous motion codes
to its constant input tensor. In all other aspects, it isentirely equivalent to its image-based counterpart.
The discriminator model takes frames of a sparsely sampled video, independently extracts features from them, concatenates those features together channel-wise into a global video descriptor and predicts the real/fake class from it. is conditioned on the time distances between frames and we use hypernetwork-based conditioning to input this information.
Overview. Generator consists of three components: content mapping network , motion mapping network and synthesis network . and are equivalent to their StyleGAN2’s counterparts with the exception that we tile and concatenate motion codes to the constant input tensor of .
A video is generated the following way. First, we sample the content noise and, following StyleGAN2, transform it into latent code . It is shared for all timesteps of a video. Then, to generate a frame in the specified time location , we first compute its motion code , which is done in three steps. First, we sample a discrete sequence of equidistant trajectory noise (we assume everywhere), positioned at distance from one another. The number of tokens is determined by the condition , i.e. it should be long enough to cover the desired timestep .444In practice, since uses padding-less convolutions, this sequence is slightly larger. We elaborate on this in Appx B. Then, we process it with conv1d-based motion mapping network with a large kernel size into the sequence . After that, we take a pair of tokens which lies between (i.e. for some and ) and compute an acyclic positional embedding
from them, described next. This positional embedding serves as the motion code for our generator. In fact, we do not need to sample all the motion noise vectorsto produce , but only those ones which depends on. In this way, our generator can produce frames non-autoregressively.
Acyclic positional encoding. Traditional positional embeddings [SIREN, FourierFeatures] are cyclic by default. This does not create problems in traditional applications (like image or scene representations) because utilized spatial domain there never exceeds the period length [NeRF, INR-GAN]. But for video generation, cyclicity is not desirable, because it makes a video getting looped at some point. To solve this issue, we develop the acyclic positional encoding mechanism.
A sine-based positional embedding vector can be expressed in the following form:
where denotes element-wise vector multiplication, are amplitudes, periods and phases of the corresponding waves, and the sine function is applied element-wise. By default, these embeddings are periodic and always the same for any input [SIREN, FourierFeatures, NeRF], which is not desirable for video synthesis, where natural videos contain different motions and are typically aperiodic. To solve this issue, we compute the wave parameters from motion noise the following way. First, “raw” motion codes are computed using wave parameters predicted from :
and are learnable weight matrices. Using directly as motion codes does not lead to good results since it contains discontinuities (see Fig 8(d)). That’s why we “stitch” their start and end values via:
where is a learnable weight matrix and lerp is the element-wise linear interpolation between and using the time position . The first subtraction in Eq (4) alters the positional embeddings to make them converge to zero values at locations . This limits the expressive power of the positional embeddings and that’s why we add the “alignment” vectors to restore it. See Fig 8(e) for the visualization.
In practice, we found it useful to compute periods as:
where is a vector of ones and are linearly-spaced scaling coefficients. See Appx B and the source code for details.
One could try using continuous codes directly as motion codes instead of . This also eliminates cyclicity (in theory), but leads to poor results in practice: if the distance is small, then the motion trajectory will contain unnatural sharp transitions; and when is increased, loses its ability to properly model high-frequency motions (like blinking) since the codes change too slowly. We empirically validate this Tab 2 (also see samples on the project webpage).
Modern video generators typically utilize two separate discriminators which operate on image and video levels separately [DVD_GAN, MoCoGAN, MoCoGAN-HD]. But since we train on extremely sparse videos and aim to have a computationally efficient model, we propose to use a holistic hypernetwork-based discriminator , which is conditioned on the time distances between frames . It consists of two parts: 1) feature extractor backbone , which independently embeds an image frame into a 3D feature vector ; and the convolutional head , which takes the concatenation of all the features
and outputs the real/fake logit.
We input the time distances information between frames into the following way. First, we encode them with positional encoding [SIREN, FourierFeatures], preprocess with a 2-layer MLP into and concatenate into a single vector . After that, we use to modulate the weights of each first convolutional layer in each block of and and also as a conditional vector in the projection head [ProjectionDiscriminator] in the StyleGAN2’s DiscrEpilogue block. The modulation in is equivalent to modulation in , but uses instead of
vector: it is passed through a mapping network, transformed with an affine layer and multiplied on a 4D weight tensor of a convolutional layer across the input channel axis. We do not modulateeach convolutional layer from practical considerations: in practice, ModConv2d is heavier than Conv2d. The overall architecture is visualized in Fig 6.
Such an approach is greatly more efficient than using dual discriminators and since we no longer need the expensive conv3d-based discriminator, which is too expensive to operate on high-resolution videos [DVD_GAN]. Moreover, as we demonstrate in Fig 4, it provides a more informative learning signal to .
Videos are continuous signals and any practical video generator relies on some sort of subsampling. The question of how many samples are necessary to train a video synthesis model is fundamental, because this design decision greatly influences the quality and training cost. In our work, we empirically show that one can train a state-of-the-art video generator with as few as 2 frames per video.
Consider the problem of learning a probability distributionand consider that we utilize sparse training, i.e. select coordinates of the vector randomly on each iteration of the optimization process. Then our optimization objective is equivalent to learning all possible marginal distributions
instead of learning the joint distribution. When does learning marginals allow to obtain the full joint distribution at the end? The following simple statement adds some clarity to this question.
Let’s denote by a collection of sets of up to indices s.t. we have for all . In other words, is a set of up to indices . Then, can be represented as a product of marginals for if and only if there exists s.t. .
The above statement is primitive (see the proof in Appx F) but can provide useful practical intuition. For video synthesis, it implies that one can learn a video generator by using only frames per video only if for any frame , there exists at most previous frames sufficient to properly predict it (see Appx F). And we argue that for the modern video synthesis datasets, one does not need a lot of frames to make such a prediction. For example, for SkyTimelapse [SkyTimelapse_dataset], the motions are typically unidirectional and thus easily predictable from only 2 previous frames (which corresponds to training with frames per video). But surprisingly, in practice we found that using even frames per clip can provide state-of-the-art performance.
We treat videos as infinite continuous signals, but in practice one has to set a limit on the maximum time location which can be seen during training. To the best of our knowledge, previous methods use at most [TGANv2, Hierarchical_video_generation], but in our case we train the model with , which does not lead to much additional computational burden due to the non-autoregressive nature of our generator. However, we set the maximum distance between and to 32 to cover short-term and medium-term movements: otherwise, we observed unstable training and abrupt motions in video samples. To sample frames, we first sample the distance and then . After that, frames locations for are selected at random without repetitions.
|+ StyleGAN2 backbone||55.62||309.3||85.88||272.8||1821.4||2311.3||638.5||463.0||8|
|MoCoGAN-HD [MoCoGAN-HD]||111.8||653.0||164.1||878.1||1729.6||2606.5||579.1||628.2||7.5 + 9|
|VideoGPT [VideoGPT]||185.9||N/A||222.7||N/A||2880.6||N/A||136.0||N/A||16 + 16|
|DIGAN [DIGAN] (concurrent work)||62.5||1824.7||83.11||196.7||1630.2||2293.7||436.6||369.0||16|
Datasets. We test our model on 5 benchmarks: FaceForensics [FaceForensics_dataset], SkyTimelapse [SkyTimelapse_dataset], UCF101 [UCF101_dataset], RainbowJelly (introduced by us) and MEAD [MEAD_dataset]. We used the train splits (when they are available) for all the datasets except for UCF101, which is an extremely difficult dataset and we used train+test splits for it. We noticed that modern video synthesis datasets are either too simple or too difficult in terms of content and motion, and there are no datasets “in-between”. That’s why we introduce RainbowJelly: a dataset of “floating” jellyfish from the Hoccori Japan youtube channel, which contains 8 hours of videos in total and has resolution. It contains simple content but complex hierarchical motions and this makes it a challenging but approachable test-bed for evaluating modern video generators. We provide its details in Appx E. All the datasets have FPS, except for RainbowJelly and MEAD, which have 30 FPS.
Evaluation. Following prior work, we use Frechet Video Distance (FVD) [FVD]
and Inception Score (IS) as our evaluation metrics with FVD being the main one since FID (its image-based counterpart) better aligns with human-perceived quality[FID]. We use two versions of FVD: FVD and FVD, which use 16-frames-long and 128-frames-long videos respectively to compute their statistics. Inception Score is used only to evaluate the generation quality on UCF-101 since it uses a UCF-101-finetuned C3D model [TGAN].
The official FVD project [FVD] does not provide the complete implementation of the evaluation pipeline, but rather the inference script for a single batch of videos, which are required to be already resized to and loaded into memory. This creates discrepancies in the evaluation protocols used by previous works since FVD, (similar to FID [FID_evaluation]) is very sensitive to the interpolation procedures and perceptually unnoticeable artifacts introduced by data processing procedures, like JPEG compression. We also found it to be very sensitive to how one extracts clips from real videos to compute the statistics. We implement and release a complete FVD evaluation protocol and use it to evaluate all the methods for fair comparison. It is documented in Appx C.
Baselines. We use 5 baselines for comparison: MoCoGAN [MoCoGAN], MoCoGAN [MoCoGAN] with the StyleGAN2 [StyleGAN2] backbone, VideoGPT [VideoGPT], MoCoGAN-HD [MoCoGAN-HD] and DIGAN [DIGAN]. For MoCoGAN with the StyleGAN2 backbone (denoted as MoCoGAN-SG2), we replaced its generator and image-based discriminator with the corresponding StyleGAN2’s components, leaving its video discriminator unchanged. We also used the training scheme and regularizations from StyleGAN2. MoCoGAN was trained for 5 days on a single GPU since its lightweight DC-GAN[DC_GAN] backbone makes it fast to train, while MoCoGAN+SG2 was trained for 2 days on GPUs to reach 25M real images seen by its image-based discriminator. MoCoGAN-HD is trained for 4.5 days on v100 GPUs, as specified in the original paper (Appx B of [MoCoGAN-HD]). We trained VideoGPT for the maximum affordable total time of 32 GPU-days in our resource constraints. We trained DIGAN [DIGAN]
for 5 days since we observed that by that time the metrics either plateaued or exploded (for RainbowJelly). We also noticed that DIGAN uses weighted sampling during training by selecting clips from long videos with higher probabilities. This hurts its FVD score and that’s why we altered its data sampling strategy to the uniform one, which is used by other methods[MoCoGAN, MoCoGAN-HD, VideoGPT]. For each method we selected the checkpoint with the best FVD value.
For the main evaluation, we train our method and all the baselines from scratch on the described datasets. Each model is trained on NVidia V100 32 GB GPUs, except for VideoGPT, which is very demanding in terms of GPU memory for resolution and we had to train it on NVidia A6000 instead (with the overall batch size of 4R1_reg]. We reduce the learning rate by 10 for the module of MoCoGAN+SG2 since it does not have equalized learning rate [ProGAN]. We use for all the experiments except for SkyTimelapse, where we used . See other training details in Appx B. We evaluate all the methods under the same evaluation protocol, described in Appx C and report the results in Table 1.
To measure the efficiency, we use the amount of GPU days required to train a method.
We build on top of the official StyleGAN2 implementation.555https://github.com/NVlabs/stylegan2-ada-pytorch
The training cost of the image-based StyleGAN2 to reach its specified 25M images is NVidia V100 GPU-days in our environment.
StyleGAN-V is trained for 2 days, which corresponds to 23M real frames seen by the discriminator.
MoCoGAN-HD is built on top of stylegan2-pytorch
stylegan2-pytorch’s codebase666https://github.com/rosinality/stylegan2-pytorch, which is times slower than the highly optimized NVidia’s implementation. That’s why in Table 1 we report its training cost reduced by a factor of 2 to account for this.
|w/o hyper-modulation in||88.8||161.5||69.8||184.1|
|w/o projection cond in||95.4||236.0||102.1||210.3|
|w LSTM codes,||131.9||159.1||135.7||196.1|
|w LSTM codes,||180.3||94.55||95.71||165.8|
|Number of frames||FaceForensics||SkyTimelapse|
Our method significantly outperforms the existing ones at almost all the benchmarks in terms FVD and FVD. On UCF101, all the methods perform poorly, which shows that it is a too difficult benchmark for modern video generators and one needs to train models of extreme scale to fit it [DVD_GAN]. For completeness, the Inception Score [TGAN] results are provided in Table 5 in Appx C.
We visualize the samples in Fig 1 and Fig 7. Our method is able to generate hour-long plausibly looking videos, though the motion diversity and global motion coherence for them would be limited (see Appx A). MoCoGAN-HD suffers from the LSTM instability when unrolled to large lengths and does not produce diverse motions. DIGAN produces high-quality videos on SkyTimelapse because its inductive bias of having joint spatio-temporal positional information is well suited for videos that have an entire scene moving. But for FaceForensics, this leads to a “head flying away” effect (see Appx H). To generate 1-hour long videos from MoCoGAN-HD, we unroll its LSTM model to the required depth ( steps) and synthesize frames only in the necessary time positions, while DIGAN, similar to our method, is able to generate frames non-autoregressively.
To ablate the core components, we replaced or modules with their MoCoGAN+SG2 counterparts. In the both cases, their removal leads to poor short-term and long-term video quality, as specified by the corresponding metrics in Table 2 and video samples in the supplementary.
Replacing continuous motion codes with , produced by the LSTM model hurts the performance, especially when the distance between motion codes is small. This happens due to unnaturally abrupt transitions between frames and we provide the corresponding samples in the supplementary. The corresponding results are in Table 2.
Another architectural decision which we verify is the conditioning scheme utilized in . We use both ModConv2d blocks and the projection conditioning head [ProjectionDiscriminator] in DiscrEpilogue in our discriminator to provide the conditioning signal about the time differences between frames . Removing any of them hurts the performance, because it constrains the ability of to understand the temporal scale it is currently operating on. We ablated the hypernetwork-based modulation by feeding a vector of zeros instead of positional embeddings of time differences into ModConv2d
so that only the bias vectors in the corresponding affine transforms participate in the modulation process.
An important design choice is how many samples per video one should use during training. We try different values of for and and report the corresponding results in Table 3. As being discussed in Sec 3.3, for existing video generation benchmarks, it might be enough to sample only several frames per each video, and our experiments confirm this observation. The performance is decreased for larger , but this might be attributed to a weaker temporal aggregation procedure of , which simply concatenates features together. It is surprising to see that modern datasets can be fit with as few as 2 samples per video.
Our generator is able to generate arbitrarily long videos. Our design of motion codes allows StyleGAN-V not to suffer from stability problems when unrolled to large (potentially infinite) video lengths. This is verified by visualizing the video clips for the extremely large timesteps in Fig 1 and Fig 8. We also demonstrate its ability to produce videos in arbitrarily high frame-rate in the supplementary.
Our model has the same latent space manipulation properties as StyleGAN2. To show this, we conduct two experiments: embedding, editing and animating an off-the-shelf image and editing and animating the first frame of a generated video. To embed an image, we used the optimization procedure similar to [Image2StyleGAN], but considering it to be positioned at . To edit an image with CLIP, we used the procedure of [StyleCLIP]. The results of these experiments are visualized in Fig 2 and we provide the details in Appx B and more examples in the supplementary. Apart from showing the good properties of its latent space, these experiments demonstrate the extrapolation potential of our generator.
StyleGAN-V has almost the same training efficiency and image quality as StyleGAN2. In Fig 3, we plot the FID scores and training costs of modern video generators on FaceForensics by their corresponding FVD scores. Our method comes very close to StyleGAN2: it converges to FID of 9.44 in 8 GPU-days compared to FID of 8.42 in 7.72 GPU-days for StyleGAN2, which is only 10% worse. This raises a question of whether video generators can be as computationally efficient and good in terms of image quality as image-based ones.
Our model is the first one which is directly trainable on resolution. We provide the generations on MEAD for our method and for MoCoGAN-HD. MoCoGAN-HD cannot preserve the identity of a speaker and diverges for large video lengths, while our method achieves comparable image quality and coherent motions. For this dataset, our model was trained for 7 days on NVidia v100 GPUs and obtained FID of 24.12 and FVD of 156.1. Image generator for MoCoGAN-HD was trained for days on A6000 GPUs, while its video generator was trained for only days since it didn’t require high-resolution training.
Our discriminator provides more informative learning signal to . Fig 4 visualizes the gradient signal to the generator from our discriminator and the conv3d-based video discriminator of MoCoGAN-HD, measured at of training for our method (at 10M images seen by
) and MoCoGAN-HD (at the 300-th epoch). In our case, one can easily see fine-grained details of the face structure, perceived by, while in case of MoCoGAN-HD, most of the gradient is redundant and lack any structural information.
Content and motion decomposition. Similar to MoCoGAN [MoCoGAN], our generator captures content and motion variations in a disentangled manner: altering motion codes while fixing does not change the appearance variations (like, a speaker’s identity). Similarly, re-sampling does not influence motion patterns on a video, but only its content. We provide the corresponding visualizations on the project website.
In this work, we provided a different perspective on time for video synthesis and built a continuous video generator using the paradigm of neural representations. For this, we developed motion representations through the lens of positional embeddings, explored sparse training of video generators and redesigned a typical dual structure of a video discriminator. Our model is built on top of StyleGAN2 and features a lot of its perks, like efficient training, good image quality and editable latent space.
We hope that our work would serve as a solid basis for building more powerful video generators in the future. The limitations and potential negative impact are discussed in Appx A.
Our model has the following limitations:
Limitations of sparse training. In general, sparse training makes it impossible for to capture complex dependencies between frames. But surprisingly, it provides state-of-the-art results on modern datasets, which (using the statement from Sec 3.3) implies that they are not that sophisticated in terms of motion.
. Similar to other machine learning models, our method is bound by the dataset quality it is trained on. For example, for FaceForensicsdataset [FaceForensics_dataset], our embedding and manipulations results are inferior to StyleGAN2 ones [Image2StyleGAN]. This is due to the limited number of identities (just 700) in FaceForensics and their larger diversity in terms of quality compared to FFHQ [StyleGAN], which StyleGAN2 was trained on.
Periodicity artifacts. still produces periodic motions sometimes, despite of our acyclic positional embeddings. Future investigation on this phenomena is needed.
Poor handling of new content appearing. We noticed that our generator tries to reuse the content information encoded in the global latent code as much as possible. It is noticeable on datasets where new content appears during a video, like Sky Timelapse or Rainbow Jelly. We believe it can be resolved using ideas similar to ALIS [ALIS].
Sensitivity to hyperparameters. We found our generator to be sensitive to the minimal initial period length (See Appx B). We increased it for SkyTimelapse [SkyTimelapse_dataset] from 16 to 256: otherwise it contained unnatural sharp transitions.
We plan to address those limitations in our future works.
The potential negative impact of our method is similar to those of traditional image-based GANs: creating “deepfakes” and using them for malicious purposes.777https://en.wikipedia.org/wiki/Deepfake. Our model made it much easier to train a model which produces much more realistic video samples with a small amount of computational resources. But since the availability of high-quality datasets is very low for video synthesis, the resulted model will fall short compared to its image-based counterpart, which could use rich, extremely qualitative image datasets for training, like FFHQ [StyleGAN].
Note, that all the details can be found in the source code: https://github.com/universome/stylegan-v.
Our model is built on top of the official StyleGAN2-ADA [StyleGAN2-ADA] repository888https://github.com/nvlabs/stylegan2-ada. In this work, we build a model to generate continuous videos and a reasonable question to ask was why not use INR-GAN [INR-GAN] instead (like DIGAN [DIGAN]) to have fully continuous signals? The reason why we chose StyleGAN2 instead of INR-GAN is that StyleGAN2 is amenable to the mixed-precision training, which makes it train times faster. For INR-GAN, enabling mixed precision severely decreases the quality and we hypothesize the reason if it is that each pixel in INR-GAN’s activations tensor carries more information (due to the spatial independence) since the model cannot spatially distribute information anymore. And explicitly restricting the range of possible values adds a strict upper bound on the amount of information one each pixel is able to carry. We also found that adding coordinates information does not improve video quality for our generator neither qualitatively, nor in terms of scores.
Similar to StyleGAN2, we utilize non-saturating loss and regularization with the loss coefficient of 0.2 in all the experiments, which is inherited from the original repo and we didn’t try any hyperparameter search for it. We also use the fmaps parameter of 0.5 (the original StyleGAN2 used fmaps parameter of 1.0), which controls the channel dimensionalities in and , since it is the default setting for StyleGAN2-ADA for resolution. This allowed us to further speedup training.
The dimensionalities of are all set to 512.
As being stated in the main text, we use a padding-less conv1d-based motion mapping network with a large kernel size to generate raw motion codes . In all the experiments, we use the kernel size of
and stride of. We do not use any dilation in it despite the fact that they could increase the temporal receptive field: we found that varying the kernel size didn’t produce much benefit in terms of video quality. Using padding-less convolutions allows the model to be stable when unrolled at large depths. We use 2 layers of such convolutions with a hidden size of 512. Another benefit of using conv1d-based blocks is that in contrast to LSTM/GRU cells one can practically incorporate equalized learning rate [ProGAN] scheme into it.
Using conv1d-based motion mapping network without paddings forces us to use “previous” motion noise codes . That’s why instead of sampling a sequence , we sample a slightly larger one to adjust for the reduced sequence size. For the same-padding strategy, for sampling a frame at position , we would need to produce motion noise codes . But with our kernel size of 11, with 2 layers of convolutions and without padding, the resulted sequence size is .
The training performance of VideoGPT on UCF101 is surprisingly low despite the fact that it was developed for such kind of datasets [VideoGPT]. We hypothesize that this happens due to UCF101 being a very difficult dataset and VideoGPT being trained with the batch size of 4 (higher batch size didn’t fit our 200 GB GPU memory setup), which damaged its ability to learn the distribution.
To train our model, we also utilized adaptive differentiable augmentations of StyleGAN2-ADA [StyleGAN2-ADA], but we found it important to make them video-consistent, i.e. applying the same augmentation for each frame of a video. Otherwise, the discriminator starts to underperform, and the overall quality decreases. We use the default bgc augmentations pipe from StyleGAN2-ADA, which includes horizontal flips, 90 degrees rotations, scaling, horizontal/vertical translations, hue/saturation/brightness/contrast changes and luma axis flipping.
While training the model, for real videos we first select a video index and then we select random clip (i.e., a clip with a random offset). This differs from the traditional DIGAN or VideoGPT training scheme, that’s why we needed to change the data loaders to make them learn the same statistics and not get biased by very long videos.
To develop this project, NVidia v100 32GB GPU-years + NVidia A6000 GPU-years were spent.
In this subsection, we describe the embedding and editing procedures, which were used to obtain results in Fig 2.
Projection. To project an existing photogrpah into the latent space of , we used a procedure from StyleGAN2 [StyleGAN2], but projecting into space [Image2StyleGAN] instead of , since it produces better reconstruction results and does not spoil editing properties. We set the initial learning rate to and optimized a code for LPIPS reconstruction loss [LPIPS] for 1000 steps using Adam. For motion codes, we initializated a static sequence and kept it fixed during the optimization process. We noticed that when it is also being optimized, the reconstruction becomes almost perfect, but it breaks when another sequence of motion codes is plugged in.
Editing. Our CLIP editing procedure is very similar to the one in StyleCLIP [StyleCLIP], with the exception that we embed an image assuming that it is a video frame in location . On each iteration, we resample motion codes since all our edits are semantic and do not refer to motion. We leave the motion editing with CLIP for future exploration. For the sky editing video presented in Fig 2, we additionally utilize masking: we initialize a mask to cover the trees and try not to change them during the optimization using LPIPS loss. For all the videos, presented in the supplementary website, no masking is used.
The details can be found in the provided source code.
Mitigating high-frequency artifacts. We noticed that if our periods are left unbounded, they might grow to very large values (up to magnitude of ), which corresponds to extra high frequencies (the period length becomes less than 4 frames) and leads to temporal aliasing. That’s why we process them via the transform: this bounds them into range with the mean of 1.0, i.e. using the at-initialization frequency scaling, which we discuss next.
Linearly spaced periods. An important design decision is the scaling of periods since at initialization it should cover both high-frequency and low-frequency details. Existing works use either exponential scaling (e.g., [NeRF, Nerfies, ALIS, SAPE]) or random scaling (e.g., [SIREN, FourierFeatures, INR-GAN, CIPS]). In practice, we scale the -th column of the amplitudes weight matrix with the value:
where we use frames and frames in all the experiments, except for SkyTimelapse, for each we use . We call this scheme linear scaling and use it as an additional tool to alleviate periodicity since it greatly increases the overall cycle of a positional embedding (see Fig 9). See also the accompanying source code for details.
Another benefit of using our positional embeddings over LSTM is that they are “always stable”, i.e. they are always in a suitable range.
For the practical implementation, see the provided source code: https://github.com/universome/stylegan-v.
In this section, we describe the difficulties of a fair comparison of the FVD score. There are discrepancies between papers in computing even FID [FID_evaluation]. So, it is less surprising that computing FVD for videos diverge even more and has even more implications for methods evaluation.
First, we note that I3D model [i3d] has different weights on tf.hub https://tfhub.dev/deepmind/i3d-kinetics-400/1 — the model which is used in the official FVD repo.999https://github.com/google-research/google-research/blob/master/frechet_video_distance — compared to its official release in the official github repo implementation 101010https://github.com/deepmind/kinetics-i3d That’s why we manually exported the weights from tf.hub and used this github repo 111111https://github.com/hassony2/kinetics_i3d_pytorch to obtain an exact implementation in Pytorch.
There are several issues with FVD metric on its own. First, it does not capture motion collapse, which can be observed by comparing FVD and FVD scores between StyleGAN-V and StyleGAN-Vwith LSTM motion codes instead of our ones: the latter one has a severe motion collapse issue (see the samples on our website) and has similar or lower FVD scores compared to our model: 196.1 or 165.8 (depending on the distance between anchors) vs 197.0 for our model. Another issue with FVD calculation is that it is biased towards image quality. If one trains a good image generator, i.e. a model which is not able to generate any videos at all, then FVD will still be good for it even despite the fact that it would have degenerate motion.
We also want to make a note on how we compute FID for vidoe generators. For this, we generate 2048 videos of 16 frames each (starting with ) and use all those frames in the FID computation. In this way, it gives 33k images to construct the dataset, but those images will have lower diversity compared to a typically utilized 50k-sized set of images from a traditional image generator [StyleGAN]. The reason of it is that 16 images in a single clip likely share a lot of content. A better strategy would be to generate 50k videos and pick a random frame from each video, but this is too heavy computationally for models which produce frames autoregressively. And using just the first frame in FID computation will unfairly favour MoCoGAN-HD, which generates the very first frame of each video with a freezed StyleGAN2 model.
FVD is greatly influenced by 1) how many clips per video are selected; 2) with which offsets; and 3) at which frame-rate. For example, SkyTimelapse contains several extremely long videos: if we select as many clips as possible from each real video, that it will severely bias the statistics of FVD. For FaceForensics, videos often contain intro frames during their first 0.5-1.0 seconds, which will affect FVD when a constant offset of is chosen to extract a single clip per video.
That’s why we use the following protocol to compute FVD.
Computing real statistics. To compute real statistics, we select a single clip per video, chosen at a random offset. We use the actual frame-rate of the dataset, which the model is being trained on, without skipping any frames. The problem of such an approach is that for datasets with small number of long videos (like, FaceForensics, see Table 7even for FaceForensics . The largest standard deviation we obserbed was when computing FVD on RainbowJelly: on this dataset it was for VideoGPT, but it is of its overall magnitude.
Computing fake statistics. To compute fake statistics, we generate 2048 videos and save them as frames in JPEG format via the Pillow library. We use the quality parameter for doing this, since it was shown to have very close quality to PNG, but without introducing artifacts that would lead to discrepancies [FID_evaluation]. Ideally, one would like to store frames in the PNG format, but in this case it would be too expensive to represent video datasets: for example, MEAD would occupy terabytes of space in this case.
|When resized to||38.92||59.86|
|With jpg/png discrepancy||80.17||71.40|
|When using all clips per video||84.59||72.03|
|When using only first frames||91.64||59.74|
|When using subsampling of||82.88||90.21|
|Still real images||342.5||166.8|
We illustrate the subtleties of FVD computation in Table 4. For this, we compute real/fake statistics for our model in several different ways:
Resized to . Both fake and real statistics images are resized into resolution via the pytorch bilinear interpolation (without corners alignment) before computing FVD.
JPG/PNG discrepancy. Instead of saving fake frames in JPG with , we use parameter in the PIL library. This creates more JPEG-like artifacts, which, for example, FID is very sensitive to.
Using all clips per video. We use all available -frames-long clips in each video without overlaps. Note, that our model was trained
Using only first frames. In each real video, instead of using random offsets to select clips, we use the first frames.
Using subsampling. When sampling frames for computing real/fake statistics, we select each -th frame. This is the strategy which was employed for some of the experiments in the original paper [FVD] — but in their case, authors trained the model on videos with this subsampling.
For completeness, we also provide the Inception Score [TGAN] on UCF-101 dataset in Table 5. Note that is computed by resizing all videos to spatial resolution (due to the internal structure of the C3D [c3d] model), which makes it impossible for it to capture high-resolution details of the generated videos, which is the focus of the current work.
|Method||Inception Score [TGAN]|
In Tab 6, we provide the numbers, used in Fig 3. Note that StyleGAN2 training in our case is slightly slower than the officially specified one (7.3 vs 7.7 GPU days)121212https://github.com/NVlabs/stylegan2-ada-pytorch, which we attribute to a slightly slower file system on our computational cluster.
In this section, we provide a list of ideas, which we tried to make work, but they didn’t work either because the idea itself is not good, or because we didn’t put enough experimental effort into investigating it.
Hierarchical motion codes. We tried having several layers of motion codes. Each layer has its own distance between the codes. In this way, high-level codes should capture high-level motion and bottom-level codes should represent short local motion patterns. This didn’t improve the scores and didn’t provide any disentanglement of motion information. We believe that the motion should be represented differently (similar to FOMM [FOMM]), rather than with motion codes, because they make it difficult for to make them temporily coherent.
Maximizing entropy of motion codes to alleviate motion collapse. As an additional tool to alleviate motion collapse, we tried to maximize entropy of wave parameters of our motion codes. The generator solved the task of maximizing the entropy well, but it didn’t affect the motion collapse: it managed to save some coordination dimensions of specifically to synchronize motions.
Prorgressive growing of frequences in positional embeddings. We tried starting with low-frequencies first and progressively open new and new ones during the training. It is a popular strategy for training implicit neural representations on reconstruction tasks (e.g., [Nerfies, SAPE]), but in our case we found the following problem with it. The generator learned to use low frequencies for representing high-frequency motion and didn’t learn to utilize high frequencies for this task when they became available. That’s why high-frequency motion patterns (like blinking or speaking) were unnaturally slow.
Continuous LSTM with EMA states. Our motion codes use sine/cosine activations, which makes them suffer from periodic artifacts (those artifacts are mitigated by our parametrization, but still present sometimes). We tried to use LSTM, but with exponential moving average on top of its hidden states to smoothen out motion representations temporally. However, (likely due to the lack of experimental effort which we invested into this direction), the resulted motion representations were either too smooth or too sharp (depending on the EMA window size), which resulted in unnatural motions.
Concatenating spatial coordinates. INR-GAN [INR-GAN] uses spatial positional embeddings and shows that they provide better geometric prior to the model. We tried to use them as well in our experiments, but they didn’t provide any improvement neither in qualitatively, nor quantitatively, but made the training slightly slower (by %10) due to the increased channel dimensionalities.
Feature differences in . Another experiment direction which we tried is computing differences between activations of next/previous frames in a video and concatenating this information back to the activations tensor. The intuition was to provide information with some sort of “latent” optical flow information. However, it made too powerful (its loss became smaller than usual) and it started to outpace too much, which decreased the final scores.
Predicting instead of conditioning in . There are two ways to utilize the time information in : as a conditioning signal or as a learning signal. For the latter one, we tried to predict the time distances between frames by training an additional head to predict the class (we treated the problem as classification instead of regression since there is a very limited amount of time distances between frames which sees during its training). However, it noticeably decreased the scores.
Conditioning on video length. For unconditional UCF-101, it might be very important for to know the video length in advance. Because some classes might contain very short clips (like, jumping), while others are very long, and it might be useful for to know in advance which video it will need to generate (since we sample frames at random time locations during training). However, utilizing this conditioning didn’t influence the scores.
For our RainbowJelly benchmark, we used the following film: https://www.youtube.com/watch?v=P8Bit37hlsQ. It is an 8-hour-long movie of jellyfish in 4K resolution and 30 FPS from the Hoccori Japan youtube video channel. We cannot release this dataset due to the copyright restrictions, but we released a full script which processes it (see the provided source code). To construct a benchmark, we sliced it into 1686 chunks of 512 frames each, starting with the 150-th frame (to remove the loading screen), center-cropped and resized into resolution. This benchmark is advantageous compared to the existing ones in the following way:
It contains complex hierarchical motions:
a jellyfish flowing in a particular direction (low-frequency global motion);
a jellyfish pushing water with its arms (medium-frequency motion)
small perturbations of jellyfish’s body and tentacles (high-frequency local motion).
It is a very high-quality dataset (4K resolution).
It is simple in terms of content, which makes the benchmark more focused on motions.
It contains long videos.
In this section, we elaborate on our simple theoretical exposition from Sec 3.3
Consider that we want to fit a probabilistic model to the real data distribution . For simplicity, we will be considering a discrete finite case, i.e. , but note that videos, while continuous and infinite in theory, are still discretized and have a time limit to fit on a computer in practice. For fitting the distribution, we use -sparse training, i.e. picking only random coordinates from each sample during the optimization process. In other words, introducing -sparse sampling reformulates the problem from
where is a problem-specific distance function between probability distributions, is a collection of all possible sets of unique indices and denotes a sub-vector of . This means, that instead of bridging together full distributions we choose to bridge all their possible marginals of length instead. When solving Eq. (8) will help us to obtain the full joint distribution ? To investigate this question, we develop the following simple statement.
Let’s denote by a collection of sets of up to indices s.t. we have for all .
Using the chain rule, we can representas:
where denotes the sequence . Now, if we know that for each , there exists with s.t.:
then is obviously simplified to:
Does this tell anything useful? Surprisingly, yes. It says that if is simple enough that instead of using the whole history to model
it’s enough to use only some set “representative moments”(unique for each ) with the size , then -sparse training is a viable alternative. After fitting via -sparse training, we will be able to obtain using Eq (10) even though ! Note, that one can obtain a conditional distributional from the marginal one for some set of indicies via:
But we would also like to have the “reverse” dependency, i.e. knowing that if we can approximate the distribution via a set of marginals, then this distribution is not too difficult. For this claim, we will need to consider marginals not of an arbitrary form , but of the form , and we would need exactly of those. The reverse implication is the following. If can be represented as a product of conditionals , then for each there exists s.t. .
This statement, just like the previous one, looks obvious. But oddly, requires more than a single sentence to prove. First, we are given that:
but unfortunately, we cannot directly claim that each term in the product equals to its corresponding one in the product . For this, we first need to show that for each we have:
It can be seen from the fact, that:
This allows to cancel terms in the chain rule one by one, starting from the end, leading to the desired equality:
Does this reverse claim tells us anything useful? Surprisingly again, yes. It implies that if we managed to fit by using -sparse training, then this distribution is not sophisticated.
Merging the above two statements together, we see that can be represented as a product of conditionals for if and only if for all there exists s.t. .
What does this statement tell for video synthesis? Any video synthesis algorithm utilizes -sparse training to learn its underlying model, but in contrast to prior work, we use very small values of . This means, that we fit our model to model any -marginals of (considering that we pick frames uniformly at random) instead of the full one . And using the above statement, such a setup implies the assumption of Eq (10). This equation says that one can know everything about by just observing previous frames . In other words, must be predictable from . Moreover, it is easy to show that our statement can generalize to include several for -th frame, i.e. there might exist several explainable sets of frames.
For the ease of visualization, we provide additional samples of the model via a web page: https://universome.github.io/stylegan-v.
Our model shares a lot of similarities to DIGAN [DIGAN] and in this section we highlight those similarities and differences.
Sparse training. DIGAN also utilizes very sparse training (only 2 frames per video). But in our case, we additionally explore the optimal number of frames per video (see Sec 3.3).
Continuous-time generator. DIGAN also builds a generator, which is continuous in time. But our generator does not lose the quality at infinitely large lengths.
Dropping conv3d blocks. DIGAN also drops conv3d blocks in their discriminator. But in contrast to us, they still have 2 discriminators.
Motion representation. DIGAN uses only a single global motion code, which makes it theoretically impossible to generate infinite videos: at some point it will start repeating itself (due to the usage of sine/cosine-based positional embeddings). In our case, we use an infinite sequence of motion codes, which are being temporally interpolated, computed wave parameters from and transformed into motion codes. DIGAN mixes temporal and spatial information together into the same positional embedding, which creates the following problem: even when time changes, the spatial location, perceived by the model, also changes. This creates a “head-flying-away” effect (see the samples). In our case, we keep these two information sources decomposed from one another.
Generator’s backbone. DIGAN is built on top of INR-GAN [INR-GAN], while our work uses StyleGAN2. This allows DIGAN to inherit INR-GAN’s benefits from being spatially continuous, but at the expense of being less stable and being slower to train (due to the lack of mixed precision and increased channel dimensionalities from concatenating positional embeddings).
Discriminator structure. DIGAN uses two discriminators: the first one operates on image-level and is equivalent to StyleGAN2’s one, while the other one operates on “video” level and takes frames and the time differences between them , concatenates them all together into a 7-channel input image (tiling the time difference scalar) and passes into a model with StyleGAN2 discriminator’s backbone. In our case, we utilize a single hypernetwork-based discriminator.
Sampling procedure. We use samples per video, while DIGAN uses . Also, we sample frames uniformly randomly, while DIGAN selects and (in this way, DIGAN sometimes have ). Apart from that, they use .
INR-GAN demonstrated that it has higher throughput than StyleGAN2 in terms of images/second [INR-GAN]. But the authors compare to the original StyleGAN2 implementation and not to the one from StyleGAN2-ADA repo, which is much better optimized. Also, they use caching of positional embeddings which is only possible at test-time and has great influence on its computational performance. In this way, we found that that StyleGAN2 is times faster to train and is less consuming in terms of GPU memory than INR-GAN.
DIGAN is based on top of INR-GAN and that’s why suffers from the issues described above. We trained it for a week on v100 NVidia GPUs and observed that it stopped improving after days of training. This is equivalent to real frames seen by the discriminator (while MoCoGAN+SG2 and StyleGAN-V reach in just 2 days for the same resolution in the same environment). For the time of the submitting the main paper, there was no information about the training cost. However, the authors updated their manuscript for the time of submitting the supplementary and specify the training cost of 8 GPU-days resolution, which is consistent with our experiments (considering that we have twice as larger resolution).