Towards Accurate Generative Models of Video: A New Metric & Challenges

12/03/2018 ∙ by Thomas Unterthiner, et al. ∙ Google IDSIA Johannes Kepler University Linz 0

Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. Following their successful application in image processing and representation learning, an important next step is to consider videos. Learning generative models of video is a much harder task, requiring a model to capture the temporal dynamics of a scene, in addition to the visual presentation of objects. Although recent attempts at formulating generative models of video have had some success, current progress is hampered by (1) the lack of qualitative metrics that consider visual quality, temporal coherence, and diversity of samples, and (2) the wide gap between purely synthetic video datasets and challenging real-world datasets in terms of complexity. To this extent we propose Fréchet Video Distance (FVD), a new metric for generative models of video based on FID, and StarCraft 2 Videos (SCV), a collection of progressively harder datasets that challenge the capabilities of the current iteration of generative models for video. We conduct a large-scale human study, which confirms that FVD correlates well with qualitative human judgment of generated videos, and provide initial benchmark results on SCV.



There are no comments yet.


page 1

page 4

page 6

page 11

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in deep generative models have lead to remarkable success in synthesizing high-quality images [20, 3]

. Their versatility offers a unified approach to a variety of image processing and computer vision tasks. For example, Generative Adversarial Networks (GANs; 


) may be used to perform image super-resolution 


, image-to-image translation 

[17], and semantic segmentation [27]

. Similarly, other generative approaches are able to extract useful representations of an image via inference, and have shown promising results in semi-supervised learning 

[32], and few-shot learning [23].

An important next challenge is to learn generative models of videos. In order to accurately synthesize videos, a model must capture the temporal dynamics of a visual scene, e.g. how objects interact, in addition to their visual presentation. In doing so, generative models of video are expected to facilitate a wide range of applications, including missing-frame prediction [18], improved instance segmentation [13], or complex (relational) reasoning tasks by conducting inference [25]

. Historically, many different approaches to video generation have been explored. Traditional approaches train a recurrent neural network (e.g. an LSTM 

[15]) to perform next-frame prediction, requiring temporal dependencies to be captured by the recurrent connections that can be iterated to obtain future frames from a particular state [33, 38, 29, 10, 26]

. However, the resulting output sequence is deterministic given a set of context frames, and fails to correctly account for the many possible futures that a sequence of context frames may have in reality (e.g. due to external, unobserved factors). More recent approaches additionally condition on a set of external random variables (latent factors) 

[1, 7, 24], generate a sequence of observation entirely from noise, e.g. using GANs [45, 34, 41]

or factorize the joint distribution over the whole video, such that each pixel in the sequence is conditioned on all previously generated pixels 

[19, 4].

While a lot of progress has been made in recent years, video generation models are still in their infancy, and generally unable to synthesize more than a few seconds of video before breaking down [1]. Indeed, by looking at generated samples it is evident that learning a good dynamics model remains a major challenge in generating real world videos. However, in order to qualitatively measure progress in accurately synthesizing videos we require corresponding metrics that consider visual quality, temporal coherence, and diversity of generated samples. Likewise, in order to isolate (independent) factors that contribute to different failure modes, we require corresponding data sets that test for specific capabilities. In this work we provide a new metric, and a suite of data sets to help speed-up progress in research towards accurate generative models of videos.

Our first contribution is Fréchet Video Distance (FVD), a new metric for generative models of video222Code to compute FVD is available at FVD builds on the principles underlying Fréchet Inception Distance (FID; [14]), which has been successfully applied to images, and was later adapted to other fields [31]. We introduce a different feature representation that captures the temporal coherence of the content of a video, in addition to the quality of each frame. Unlike popular metrics such as Peak Signal to Noise Ratio (PSNR) or the Structural Similarity (SSIM; [46]) index, FVD considers a distribution over videos, thereby avoiding the drawbacks of frame-level metrics [16]. We conduct multiple experiments to evaluate FVD. By adding noise to real videos we reveal that FVD is sensitive to both temporal, and frame-level perturbations. A large-scale human study confirms that FVD coincides well with qualitative human judgment of generated videos.

Our second contribution is StarCraft 2 Videos (SCV): a suite of challenging data sets that require relational reasoning and long-term memory333The SCV data sets are available at Using the open-source StarCraft 2 Learning Environment (SC2LE;  [44]), we contribute four data sets consisting of game play from different StarCraft 2 scenarios. Each data set is scalable along many axes, including the resolution and complexity of the scenes. We provide a comprehensive comparison of current state of the art models on SCV in terms of FVD.

2 Fréchet Video Distance

Scenario Stochastic Elements Main Challenge
MUtB unit type, unit color, moving direction ‘Unit test’, reproducing movement animations
CMS placement of crystals, path through crystals Learning complex action sequences and modelling movement paths
Brawl number of units, unit types, unit positions Modelling many moving entities, and their interactions
RTwM number of units, unit types, target beacons Remembering events over many time steps
Table 1: An overview of the properties of the SCV scenarios.

An accurate generative model of videos captures the underlying data distribution with which the observed data was generated. Hence, it is natural to consider the distance between the real world data distribution and the distribution defined by the generative model

as an evaluation metric. In practice, no analytic expression of either distribution is available, which rules out straightforward application of many common distance functions. Consider for example the popular Fréchet Distance (or 2-Wasserstein distance) between

and defined by:


where the minimization is over all random variables X and Y with distributions and respectively. This expression is very difficult to solve for the general case, although it has a closed form solution when and

belong to any family of distributions that is closed with respect to linear transformations of the random vectors 

[8]. For example, when and are multivariate Gaussian the right-hand side in Eq. 1 can be reduced to:


where and are the means and and

are the co-variance matrices of

and respectively.

It can be seen that evaluating Eq. 1 becomes feasible if we assume a particular form of the distributions under consideration. Often, as is usually the case for videos, a multivariate Gaussian is not an accurate approximation of the underlying data distribution. However, by taking this approximation in a different feature space, we may be able to obtain a tighter fit. Heusel et al. [14] proposed to use a learned feature embedding when and are distributions over real world images. First, an Inception network [39]

is trained on ImageNet 


to classify images. Next, samples from

and are fed through the pre-trained network and their feature representation (activations) in one of the hidden layers is recorded. Finally, the Fréchet Inception Distance (FID; [14] ) is obtained by computing Eq. 2

using the means and covariances obtained by fitting a multivariate Gaussian distribution to the recorded responses of the real, and generated samples.

The feature representation learned by the pre-trained neural network greatly affects the quality of the metric. When training on ImageNet, features are learned that expose information useful in reasoning about objects in images, whereas other information content may be suppressed. Likewise, different layers of the network encode features at different abstraction levels. In order to obtain a suitable feature representation for videos we require a pre-trained network to consider the temporal coherence of the visual content across a sequence of frames, in addition to its visual presentation at any given point in time. In this work we investigate several variations of a pre-trained Inflated 3D Convnet (I3D) [5], and name the resulting metric the Fréchet Video Distance (FVD). The I3D network generalizes the Inception architecture to sequential data, and is trained to perform action-recognition on the Kinetics data set consisting of human-centered YouTube videos [21]. Action-recognition can be viewed as a temporal extension of image classification, requiring visual context and temporal evolution to be considered simultaneously. I3D has been shown to excel at this task, achieving state-of-the-art results on the UCF101[37] and HMDB51 [22] data sets, and placing first at the CVPR 2017 Charades challenge [21]. In this work we consider I3D networks that have been trained on the RGB frames of the Kinetics-400 data set, and on the Kinetics-600 data set444We used the pre-trained weights that are available online at

. In order to obtain suitable feature representations we consider the logits in the final layer, as well as the output of the final pooling layer analogous to prior work 


Figure 2: Video frames of each SCV scenario sampled at regular intervals. From top to bottom: Move Unit to Border (MUtB), Collect Mineral Shards (CMS), Brawl, and Road Trip with Medivac (RTwM).

A potential downside in using Eq. 2

is the potentially large error in estimating Gaussian distributions over the learned feature space. Alternatively Binkowski et al.

[2] proposed to use the Maximum Mean Discrepancy (MMD [12]) in the case of images, and we will explore this variation in the context of videos as well. MMD is a kernel-based approach, which provides a means to calculate the distance between two empirical distributions without assuming a particular form. Concretely, if and are samples drawn from two random variables and with distributions and

, then an unbiased estimator of the squared MMD distance between these two random variables is given as:

where is a kernel that measures the similarity between two input vectors. Binkowski et al. [2] proposed to use a polynomial kernel , which we will apply to the learned features of the I3D network to obtain the Kernel Video Distance (KVD).

In our experiments in Section 4, we will compare FVD (KVD) to the two main metrics that are currently used in the relevant literature: PSNR, and SSIM. The Peak Signal to Noise Ratio (PSNR) relates the maximum attainable pixel value of the pixels in an image to its Mean Squared Error (MSE) with respect to a ground-truth image. In the case of videos, the PSNR is computed for each frame with respect to a reference frame. Several studies have pointed out that PSNR does not correlate well with subjective video quality, in particular when simultaneously evaluating multiple videos with different content [16]. The Structural Similarity (SSIM; [46]) index measures the quality of an image as the perceived change in structural information. Similar to the PSNR, the SSIM metric requires access to ground-truth frames, and considers each frame individually. It suffers from the same drawbacks in comparing distributions over videos (with potentially different content) according to human judgment [30, 47].

An advantage of SSIM and PSNR is that they can be used to measure the degree to which a particular generated sequence deviates from a ground-truth sequence, i.e. given a particular context. FVD necessarily considers only distributions of videos, although we will show in Section 4 (Switching noise) that in combining the context frames and the generated frames it can account for generated frames that do not necessarily progress from a sequence of context frames. On the other hand, FVD is naturally applied to the unconditional case in which no ground-truth sequences are available, and estimates the variety among the generated sequences in addition to their visual plausibility.

3 Starcraft 2 Videos

In order to advance the capabilities of generative models for video, a number of challenges must be overcome. For example, generative models must relate similar actions that are performed by agents differing in their visual appearance (e.g. KTH [36]), or account for the highly non-linear temporal dynamics in observing a robot arm that pushes objects around on a table [10]. Here we propose to consider these challenges in a simpler setting (the StarCraft 2 Learning Environment [44]) to serve as an intermediate step towards real world video data sets.

We introduce the StarCraft 2 Videos (SCV), a suite of benchmark data sets of increasing complexity for video generation. SCV offers tasks that are meant to serve as ‘unit tests’ while developing new models, and more challenging tasks that are meant to bring current models to their limits. The latter isolates challenges in real-world video generation such as long-term memory, and relational reasoning, to highlight typical failure modes and avenues for future research. The SCV data sets can be generated at different resolutions and easily adapted to adjust the complexity of each scene. We achieve this by rendering scenes from the video game StarCraft 2 (SC2), using the open-source StarCraft 2 Learning Environment (SC2LE;  [44]). SCV includes four different scenarios, for which we can generate a virtually unlimited number of videos. Example frames from each scenario can be seen in Figure 2.

Move Unit to Border (MUtB)

A randomly chosen game unit moves from the middle of the map to its border. This introductory test scenario offers similar complexity to other introductory test scenarios, such as “Moving MNIST” [38] or “Stochastic Shapes” [1]. However, it includes various stochastic dimensions: the type of game unit, its color, and destination along the border. In addition it requires a model to learn the whole animation sequence of a moving unit, and each unit type’s moving speed.

Collect Mineral Shards (CMS)

Two units move across the screen to collect randomly placed mineral shards in a greedy fashion555This map, and the corresponding agent, were originally developed as part of the original SC2LE framework [44].. This scenario requires a video model to learn about the relation between a unit reaching a shard and the shard disappearing, and about the shortest-path pattern in which units walk from location to location.


Two large armies face each other and attempt to destroy the other. Each army is composed of a number of randomly chosen units that are spawned at random locations within a starting region. The “Terran” (left) army is ordered to attack, repeatedly targeting the nearest enemy unit, such that the winner is solely determined by the army composition, and initial location of units. An accurate generative model must learn about properties of units and their interaction, including unit types, health points, attack range, and animations.

Road Trip with Medivac (RTwM)

A flying transportation unit (“medivac”) is tasked to pick up a small group of units (one by one) that spawn at nearby locations. Next, it is tasked to visit a number of beacons, before eventually unloading the exact same units at the final beacon. In order to accurately model this scenario, a generative model must account for the long term dependencies between loading and offloading the same units.


An overview of the stochastic elements and main objectives of each scenario is listed in Table 1. For each scenario, we provide a data set consisting of 10 000 training, 2 000 validation and 2 000 test set videos in two different resolutions: and . Sequence lenghts differ between scenarios: videos in MUtB are between 11 and 27 frames long, in CMS between 60 and 99 frames long, in Brawl 99 frames long, and in RTwM videos are 32 frames long. Actions are provided in the form of visual cues (right-mouse clicks), which guide a model in predicting the next frame e.g. in MUtB, the first frame indicates the direction the unit will move to, or in CMS, it shows which crystal a unit will target next. Additional details are available in Appendix B and the data sets are available at

4 Experiments

4.1 Noise Study

We test whether FVD is sensitive to a number of basic distortions by adding various types of noise to real videos. We consider static noise added to individual frames, and temporal noise, which distorts the entire sequence of frames. While common image-based metrics are capable of detecting static noise, they were not designed to detect temporal noise.

To test whether FVD can detect static noise we added one of the following distortions to each frame in a sequence of video frames: (1) a black rectangle drawn at a random location in the frame, (2) Gaussian blur, which applies a Gaussian smoothing kernel to the frame, (3) Gaussian noise

, which interpolates between the observed frame and standard Gaussian noise, and (4)

Salt & Pepper noise

, which sets each pixel in the frame to either black or white with a fixed probability. Temporal noise was injected by (1)

locally swapping a number of randomly chosen frames with its neighbor in the sequence (2) globally swapping a number of randomly chosen pairs of frames selected across the whole sequence, (3) interleaving the sequence of frames corresponding to multiple different videos to obtain new videos, and by (4) switching from one video to another video after a number of frames to obtain new videos. We applied these distortions at up to six different intensities that are unique to each type, e.g. related to the size of the black rectangle, the number of swaps to perform, or the number of videos to interleave. Additional details are available in Section A.

Figure 3: Behaviour of FVD when adding various types of noise to different data sets, using the logits activations of the I3D model trained on Kinetics-400 as embedding.

We computed the FVD and KVD between videos from the BAIR [9], Kinetics-400[5] and HMDB51[22] data sets and their noisy counterparts. As potential embeddings, we considered the top-most pooling layer, and the logits layer of the I3D model pre-trained on the Kinetics-400 data set, as well as the same layers in a variant of the I3D model pre-trained on the extended Kinetics-600 data set. As a baseline, we compared to a naive extension of FID for videos in which the Inception network (pre-trained on ImageNet [6]) is evaluated for each frame individually, and the resulting embeddings (or their pair-wise differences) are averaged to obtain a single embedding for each video. This “FID” score is then computed according to Eq. 2.

We observed that all variants were able to detect the various injected distortions to some degree, with the pre-trained Inception network generally being inferior at detecting temporal distortions as was expected. In Figure 6 it can be seen that the logits layer of the I3D model pre-trained on Kinetics-400 is among the best performing configurations in terms of rank correlation with the sequence of noise intensities. Hence, in the remainder of this work we will continue to use this configuration when computing the FVD666A preliminary study in which a human ranking of generated videos was compared to the ranking obtained by each metric further corroborated this choice.. An overview of its scores on the noise experiments can be seen in Figure 3.

Figure 4:

FVD between two non-overlapping subsets of videos that are randomly drawn from the BAIR video pushing data set. Error bars are standard errors over 50 different tries.

4.2 Effect of Sample Size on FVD

In a second experiment we consider the accuracy with which FVD is able to calculate the true underlying distance between a distribution of generated videos and a target distribution. To calculate the FVD according to Eq. 2 we need to estimate and from the available samples. The larger the sample size, the better these estimates will be, and the better FVD will reflect the true underlying distance between the distributions. For an accurate generative model these distributions will typically be fairly close, and the noise from the estimation process will primarily affect our results. This effect has been well-studied for FID [28, 2], and is depicted for FVD in Figure 4. It can be seen that even when the underlying distributions are identical, FVD will typically be larger than zero because our estimates of the parameters and are noisy. It can also be seen that for a fixed number of samples the standard errors (measured over 50 tries) are small, and an accurate comparison can be made. Hence, it is critical that in comparing FVD values across models, one considers the same sample size777This has lead to confusion regarding FID in the past, when researchers used different sample sizes in their comparisons [28, 42, 14]..

4.3 Human Evaluation

One important criterion for the performance of generative models is the visual fidelity of the samples as judged by human observers [40]. Hence, a metric for generative models must ultimately correlate well with human judgment. To this extent we trained several conditional video generation models, and asked human raters to compare the quality of the generated videos in different scenarios.

We trained Convolutional Dynamic Neural Advection (CDNA; [10]), Stochastic Variational Video Prediction (SV2P [1]), Stochastic Video Prediction with Fixed Prior (SVP-FP [7]) and Stochastic Adversarial Video Prediction (SAVP [24]) on the BAIR data set. Using a wide range of possible hyper-parameter settings, and by including model parameters at various stages of training we obtain over 3 000 different models. Generated videos are obtained by combining 2 frames of context with the proceeding 14 output frames. Following prior work [1, 7, 24] we obtain the PSNR and SSIM scores by generating 100 videos for each input context (conditioning frames) and returning the best frame-averaged value among these videos. We consider 512 video sequences (unseen by the model) to estimate the target distribution when computing FVD.

We conduct several human studies based on different subsets of the trained models. In particular, we select models according to two different scenarios:

One Metric Equal

We consider models that are indistinguishable according to a single metric, and evaluate to what degree human raters and other competing metrics are able to distinguish these models in terms of the quality of the generated videos. We choose 10 models having roughly equal values for a given metric that are close to the best quartile of the overall distribution of that metric, i.e., the models were chosen such that they are worse than 25 % of the remaining models and better than 75 % of the remaining models as determined by the metric under consideration. We were able to choose models whose values where identical up to the first 4-5 significant digits for each metric.

One Metric Spread

In a second setting we consider to what degree models having very different scores for a given metric, coincide with the subjective quality of their generated videos as judged by humans. We choose 10 models which were equidistant between the 10 % and 90 % percentile of the overall distribution of that metric. In this case, there should be clear differences in terms of the quality of the generated videos among the models under consideration (provided that the metric is accurate), suggesting high agreement with human judgment for the metric under consideration in comparison to competing metrics.


For the human evaluation, we used 3 generated videos from each selected model. Human raters would be shown a video from two models, and then asked to identify which of the two looked better, or alternatively report that their quality was indistinguishable. Each pair of compared videos was shown to up to 3 independent raters, where the third rater was only asked if the first two raters disagreed. The raters were given no prior indication about which video was thought to be better. We calculated the correspondence between these human ratings and the ratings determined by the various metrics under consideration.

Metric eq. FVD eq. SSIM eq. PSNR spr. FVD spr. SSIM spr. PSNR
FVD N/A 74.9 % 81.0 % 71.9 % 58.4 % 63.5 %
SSIM 51.5 % N/A 44.6 % 61.8 % 51.2 % 45.9 %
PSNR 56.3 % 21.4 % N/A % 54.1 % 37.0 % 44.8 %
Among raters 79.3 % 77.8 % 84.4 % 83.3 % 69.9 % 72.5 %
Table 2: Agreement of metrics with human judgment when considering models with a fixed value for a given metric (eq.), or with spread values over a wide range (spr.). FVD is superior at judging generated videos based on subjective quality.

The results of the human evaluation studies can be seen in Table 2. It can be seen that when choosing good models with comparable FVD values (eq. FVD), both SSIM and PSNR only agreed in roughly half of the cases with human judgment. On the contrary, in the scenarios that consider good models with similar values as determined by SSIM (eq. SSIM) and PSNR (eq. PSNR), we find that FVD agrees well with human judgment (74.9 %, and 81.0 % agreement respectively) and is clearly superior in distinguishing these models in terms of performance. If we analyze the scenarios in which models are chosen at regularly spaced intervals with respect to a given metric (spr.), then we find that FVD agrees well with human judgment (71.9%) as expected. More interestingly are the cases in which models are chosen at regularly spaced intervals with respect to SSIM (spr. SSIM) and (spr. PSNR). Although these scenarios are clearly favorable for the metric under consideration, we find that FVD remains superior in agreeing with human judgment. It is clear that FVD is better equipped to rank models according to human perception of quality. Table 2 also reports the agreement among raters. These are computed as the fraction of the comparisons in which the first two raters agreed for a given video pair, averaged across all comparisons to obtain the final percentage. It can be seen that in most cases the raters are confident in comparing generated videos.

4.3.1 Resolution of FVD

While Section 4.2 demonstrates that FVD results are highly reproducible for a fixed sample size, it does not consider to what degree small differences in FVD can be considered meaningful. To answer this question, human raters were asked to compare videos generated by a randomly chosen model having an FVD of 200 / 400 (base200 / base400) and generated videos by models that were 10, 20, 50, 100, 200, 300, 400, and 500 FVD points worse. In each case we selected 5 models from the models available at these FVD scores and generated 3 videos for each model, resulting in a total of 1 800 comparisons. For a given video comparison, raters were asked to decide which of the two videos looked better, or if they were of similar quality. For each of these pairs, we asked up to 3 human raters for their opinion.

In Figure 5 it can be seen that when the difference in FVD is smaller than 50, the agreement with human raters is close to random (but never worse), and increases rapidly once two models are more than 50 FVD points apart. Hence, it can be concluded that differences of 50 FVD or more typically correspond to differences in the quality of the generated videos that can be perceived by humans.

Figure 5: Fraction of human raters that agree with FVD on which of two models is better, as a function of the difference in FVD between the models. Error bars are standard errors, and raters deciding that video pairs are of similar quality are counted as not agreeing with FVD.

4.4 Baseline Results on SC2 Benchmark Data Sets

CDNA 296.5 150.8 486.1 51.4 440.8 515.3 877.1 1016.6 1089.3 1295.4
SV2P 262.5 136.8 423.9 710.5 430.4 316.0 859.9 995.9 1068.7 1026.1
SVP-FP 315.5 208.4 276.7 121.3 379.8 442.1 714.5 1240.7 1022.9 2031.4
SAVP 116.4 78.0 479.7 204.4 188.8 192.5 192.9 150.3 698.6 1055.4
Table 3: FVD scores on various data sets. For SCV data sets, left column is , right is .

In order to provide baseline results on our data sets, as well as to give an indication of the range of sensible values for FVD, we provide results for CDNA, SV2P, SVP-FP and SAVP on BAIR and KTH as well as all the scenarios of our SC2 benchmark suite. Models on BAIR, MUtB, CMS and Brawl were provided 2 context frames and trained to generate 14 output frames. For KTH, we follow the literature standard and provide 10 frames of context and train models to output 10 frames. On RTwM it is essential that the start and the end of the video are part of the same sequence (i.e. when batching), meaning that we provide 2 context frames and then output up to 32 additional frames, which is the maximal length of an RTwM scenario. For evaluation, we used 500 videos when calculating FVD for BAIR and KTH, and 1 000 videos for SCV.

We used the Tensor2Tensor [43] implementation of each model and the default parameters unless otherwise stated. Suitable hyper-parameters were obtained using a grid-search over the learning rate ( and ), and the trade-off between reconstruction loss and KL divergence (: ) for VAE-based models. For SV2P, we followed the annealing schedule as outlined in [1]. For SAVP, we additionally tuned the GAN-loss and the GAN-VAE-loss by searching for suitable values on a logarithmic scale between and . All models were trained for a total of 300 000 update steps.

The results of this benchmark can be seen in Table 3. We find that although there are substantial differences in the quality of the generated videos (as reflected by the FVD scores), all models have similar failure modes. This is not surprising considering that many of these models are build around the same underlying ideas of combining CNNs, LSTMs and VAEs. We found that the best performing models on MUtB were able to accurately synthesize videos at low resolutions, whereas weaker models sometimes failed to generate the moving unit in a higher resolution. On the other hand, the CMS scenario caused most of the models to fail in producing accurate results. Models were able to learn that shards would disappear from the map, yet they failed both at modeling the moving units themselves, and at capturing the correct sequence with which the mineral shards were meant to be disappearing. On the Brawl map, the models were unable to capture all of the game units participating in the brawl, and instead resorted to generate larger blurry blobs. Perhaps surprisingly, neither model was able to succeed at modeling the RTwM scenario. Whether this is due to the longer video sequences that are required to be considered or inherent to the complexity of the scenario remains a question for further research. Please see Appendix C for examples of generated samples () that are representative of various behaviors.

4.5 Correlation of FVD with SSIM and PSNR

As a final experiment we evaluate the degree to which FVD correlates with SSIM, and PSNR. Using all the models from Section 4.4, we end up with more than 20 000 different models. These can be used to obtain an extensive range of potentially different generated videos that one could encounter in training generative models of video. Overall, we find that the correlation between SSIM and PSNR is generally very high (Pearson’s r= 0.730, Kendall’s =0.648), which is unsurprising given that both metrics are based on the same frame-by-frame measurements. There was a weaker, yet still statistically significant correlation between SSIM and FVD (r=-0.640, =-0.189), which indicates that SSIM does to some extend pick up to the same types of defects that FVD also detects. The correlation between FVD and PSNR was rather weak (r=-0.278, =-0.007).

5 Conclusion

We introduced the Fréchet Video Distance (FVD), a new evaluation metric for generative models of video, and an important step towards better evaluation of models for video generation. By design, the currently favored metrics, SSIM and PSNR, can only account for the quality of individual frames, and are restricted to the case in which ground-truth sequences are available. In contrast, FVD can even be used in situations where this is not the case, such as unconditional video generation via Generative Adversarial Networks. Our experiments confirm that FVD is accurate in evaluating videos that were modified to include static noise, and temporal noise. More importantly, a large scale human study among generated videos from several recent generative models reveals that FVD consistently outperforms SSIM and PSNR in agreeing with human judgment.

Our second contribution is the StarCraft 2 Videos (SCV) benchmark, which introduces several new data sets (and corresponding scenarios) of different complexity that test long-term memory and relational reasoning. In our evaluation of state of the art models for video generation we find that these challenges are indeed still open, and we invite the community at large to tackle them. It is our hope that by isolating (independent) challenges in a visually simpler setting, it becomes more easily for researchers to propose, test, and analyze potential solutions to these specific problems. In general, we believe that FVD and SCV will greatly benefit research in generative models of video in providing a well tailored, objective measure of progress.


  • [1] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. International Conference on Learning Representations (ICLR), 2017.
  • [2] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. International Conference on Learning Representations (ICLR), 2018.
  • [3] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv, 2018.
  • [4] W. Byeon, Q. Wang, R. K. Srivastava, P. Koumoutsakos, P. Vlachas, Z. Wan, T. Sapsis, F. Raue, S. Palacio, T. Breuel, et al. Contextvp: Fully context-aware video prediction. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , pages 1122–1126, 2018.
  • [5] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • [7] E. Denton and R. Fergus. Stochastic video generation with a learned prior.

    International Conference on Machine Learning (ICML)

    , 2018.
  • [8] D. Dowson and B. Landau.

    The fréchet distance between multivariate normal distributions.

    Journal of Multivariate Analysis

    , 12(3):450 – 455, 1982.
  • [9] F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip connections. In Conference on Robot Learning, pages 344–356, 2017.
  • [10] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. Advances in Neural Information Processing Systems (NIPS), 2016.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in neural information processing systems (NIPS), 2014.
  • [12] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
  • [13] E. Haller and M. Leordeanu. Unsupervised object segmentation in video by efficient selection of highly probable positive features. IEEE International Conference on Computer Vision (ICCV), 2017.
  • [14] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems (NIPS), 2017.
  • [15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [16] Q. Huynh-Thu and M. Ghanbari. The accuracy of psnr in predicting video quality for different video scenes and frame rates. Telecommunication Systems, 2012.
  • [17] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [18] H. Jiang, D. Sun, V. Jampani, M.-H. Yang, E. Learned-Miller, and J. Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [19] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. International Conference on Machine Learning (ICML), 2017.
  • [20] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. International Conference on Learning Representations (ICLR), 2018.
  • [21] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. arXiv, 2017.
  • [22] H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre. Hmdb51: A large video database for human motion recognition. In High Performance Computing in Science and Engineering ‘12, pages 571–582. Springer, 2013.
  • [23] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
  • [24] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv, 2018.
  • [25] A. Lerer, S. Gross, and R. Fergus. Learning physical intuition of block towers by example. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 430–438. JMLR. org, 2016.
  • [26] B. Lotter, G. Kreiman, and D. Cox. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. International Conference on Learning Representations (ICLR), 2017.
  • [27] P. Luc, C. Couprie, S. Chintala, and J. Verbeek. Semantic segmentation using adversarial networks. arXiv, 2016.
  • [28] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are gans created equal? a large-scale study. Advances in Neural Information Processing Systems (NIPS), 2018.
  • [29] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. International Conference on Learning Representations (ICLR), 2016.
  • [30] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, et al. Image database tid2013: Peculiarities, results and perspectives. Signal Processing: Image Communication, 30:57–77, 2015.
  • [31] K. Preuer, P. Renz, T. Unterthiner, S. Hochreiter, and G. Klambauer. Fréchet chemnet distance: A metric for generative models for molecules in drug discovery. Journal of Chemical Information and Modeling, 58(9):1736–1741, 2018.
  • [32] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He. Data distillation: Towards omni-supervised learning. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [33] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv, 2014.
  • [34] M. Saito, E. Matsumoto, and S. Saito.

    Temporal generative adversarial nets with singular value clipping.

    International Conference on Computer Vision (ICCV), 2017.
  • [35] M. S. Sajjadi, B. Schölkopf, and M. Hirsch. Enhancenet: Single image super-resolution through automated texture synthesis. International Conference on Computer Vision (ICCV), 2017.
  • [36] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR), volume 3, pages 32–36. IEEE, 2004.
  • [37] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv, 2012.
  • [38] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. International Conference on Machine Learning (ICML), 2015.
  • [39] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [40] L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. International Conference on Learning Representations (ICLR), 2016.
  • [41] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [42] T. Unterthiner, B. Nessler, C. Seward, G. Klambauer, M. Heusel, H. Ramsauer, and S. Hochreiter. Coulomb GANs: Provably Optimal Nash Equilibria via Potential Fields. International Conference on Learning Representations (ICLR), 2018.
  • [43] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit.

    Tensor2tensor for neural machine translation.

    arXiv, 2018.
  • [44] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, et al.

    StarCraft II: A new challenge for reinforcement learning.

    arXiv, 2017.
  • [45] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), 2016.
  • [46] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 2004.
  • [47] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang.

    The unreasonable effectiveness of deep features as a perceptual metric.

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

Appendix A Noise Study

A mapping of the different noise intensities to the parameter values of the various noise types that we consider can be seen in Table 4.

Noise type Parameter Int. 1 Int. 2 Int. 3 Int. 4 Int. 5 Int. 6
Black rectangle size relative to image 15% 30% 45 % 60 % 75 % N/A
Gaussian blur sigma of Gaussian kernel 1 2 3 4 5 N/A
Gaussian noise percentage of noise in convex combination 15% 30% 45 % 60 % 75 % N/A
Salt & Pepper probability of applying ‘salt’ or ‘pepper’ 0.1 0.2 0.3 0.4 0.5 N/A
Local swap number of swaps 4 8 12 16 20 24
Global swap number of swaps 4 8 12 16 20 24
Interleaving number of sequences 2 3 4 5 6 N/A
Switching number of frames until switch 1 2 3 4 5 N/A
Table 4: An overview of the different noise intensities used for different noise types.

Figure 6 provides an overview of the correlation of various implementations of FVD (and an FID-based baseline) with the sequence of noise intensities.

Figure 6: Correlation of the noise intensity and the metric measurements. It can be seen that the logits of the I3D model trained on the Kinetics 400 dataset correlates well with the noise intensities, across a variety of noise types.

Appendix B SCV Data Generation

Each data set in SCV consist of 10 000 training, 2 000 validation and 2 000 test set videos of game play of an agent playing a custom StarCraft 2 scenario. Scenarios were created using the Starcraft 2 Editor, and agents were implemented in the SC2LE framework. All agents are deterministic, and randomness is controlled by the environment, as implemented by the scenario.

Data is obtained in two phases. First a set of replay files is created by having an agent “play” a scenario in SC2LE. These encode a sequence of deterministic states in the environment, allowing the same content to be rendered at different resolutions. A sequence of states (agent interacting with the environment) is terminated once a particular ‘termination condition’ has been fulfilled. Different scenarios have different termination conditions:

  • MUtB: scenario ends when a unit reaches the border

  • CMS: scenario lasts for 2 in-game minutes

  • Brawl: scenario ends when one of the armies is victorious or 2 in-game minutes pass

  • RTwM: scenario ends when all units are unloaded at the final beacon

Since these scenarios have widely different lengths, we rendered them at different speeds: in MUtB, we recorded every frame, in CMS and Brawl every frame, and in RTwM every frame. We additionally always skipped the first 2 frames of each video, as these first frames are being rendered before the scenario is fully initialized. The stochasticity in each scenario results in replays (and corresponding videos) having different lengths. The generated data sets contain between 11-27 frames on MUtB, up to 99 frames on Brawl, and exactly 99/32 frames on CMS/RTwM respectively. We subsample shorter videos of the required length when training.

Appendix C Examples of Models trained on SCV

In the following examples, the top row shows a video sequence from the data set , and the row below shows the video generated by a model that was conditioned on the first two frames of that sequence. Compare to the quality of the generated samples to the FVD scores on the right columns in Table 3.

Figure 7: Examples from Move Unit to Border in resolution. Top to Bottom: Original video, CDNA, SV2P, SVP-FP, SAVP
Figure 8: Examples from Collect Mineral Shards in resolution. Top to Bottom: Original video, CDNA, SV2P, SVP-FP, SAVP
Figure 9: Examples from Brawl. Top to Bottom in resolution: Original video, CDNA, SV2P, SVP-FP, SAVP
Figure 10: Examples from Road Trip with Medivac in resolution. Top to Bottom: Original video, CDNA, SV2P, SVP-FP, SAVP