1 Introduction
Video synthesis is a very challenging problem [IVGAN, VideoFlowGAN, TGAN, MoCoGAN, VideoGAN], arguably even more challenging than the already difficult image generation task [GAN, AutoEncodingVB, ImprovedGAN]. The temporal dimension of the data introduces an additional mode of variation, since feasible motions are dependent on the object category and the scene appearance. Consequently, the evaluation of video synthesis methods should account not only for the quality of individual frames but also for their temporal coherence, motion realism, and diversity.
In this work, we take a closer look at the temporal quality of unconditional video generators, represented by the stateoftheart MoCoGAN approach [MoCoGAN]. Note that this subcategory of video generation is different from future frame prediction [SAVP, PredBeyond], which takes a number of initial frames as input. We only rely on the training data as input instead.^{1}^{1}1Note that the model is still conditioned on the particular training data distribution, hence not truly “unconditional”. Still, we adhere to the common terminology used in the literature.
We find that the common training strategy of sampling a fixedlength video subsequence at training time often leads to degenerate solutions. As illustrated in Fig. 1, the MoCoGAN model exhibits temporal artifacts as soon as the video sequence length at inference time exceeds the length of the temporal window at training time. We establish two common types of such artifacts. If the model continues to predict the last frame without change, we refer to that as freezing. On the other hand, looping occurs when the exact subsequence of frames is continually repeated.
To address these limitations, we make two main contributions. First, to tackle the detrimental effect of fixedlength video training, we reformulate video generation as a Markov Decision Process (MDP). This reformulation allows approximating an infinite forecast horizon in order to optimize every generated frame to its longterm effect on future frames. One benefit of our MDP formulation is that it is modelagnostic. We evaluate it by applying it to the stateoftheart MoCoGAN [MoCoGAN], which requires only a minor modification of the original design and does not significantly increase the model capacity. Second, we propose a
family of evaluation metrics
to detect and measure the temporal artifacts. Our new metrics are modelfree, simple to implement, and offer an easy interpretation. In contrast to the Inception Score (IS) [ImprovedGAN] or the recent Fréchet Video Distance (FVD) [FrechetVD], the proposed metrics do not require model pretraining and, hence, do not build upon a datasensitive prior. Our experiments show that our MDPbased formulation leads to a consistent improvement of the video quality, both in terms of the artifact mitigation as well as on the more common metrics, the IS and FVD scores.2 Related Work
Video generation models can be divided into two main categories: conditional and unconditional. Exemplified by the task of future frame prediction, conditional models historically preceded the latter and some of their features lend themselves to unconditional prediction. Therefore, we first give a brief overview of conditional approaches.
Conditional video generation
. One of the first networkbased models for motion dynamics used a temporal extension of Restricted Boltzmann Machines (RBMs)
[SutskeverH07, HumanMotionRBM] with a focus on resolving the intractable inference [RTRBM]. The increasing volume of video data for deep learning shifted the attention to learning suitable representations and enabling some control over the generated frames
[HolisticControl]. Srivastava [UnsupervisedVideoRepresentation] show that unsupervised sequencetosequence pretraining with LSTMs [HochreiterS97] enhances the performance on the supervised frame prediction task. Patchbased quantization of the output space [VideoPredictionBaseline] or predicting pixel motion [PhysicalInteraction, LiuYTLA17] can improve the frame appearance at larger resolutions. In contrast, Kalchbrenner [VideoPixelNetwork] predict pixelwise intensities and extend the context model of PixelCNNs [OordKEKVG16] to the temporal domain. A coarsetofine strategy allows to decouple the structure from the appearance [VillegasYHLL17, LongTermFuture], or dedicate individual stages of a pipeline to multiple scales [PredBeyond].The frames of a distant future cannot be extrapolated deterministically due to the stochastic nature of the problem [BabaeizadehFECL18, SAVP, Xue0BF16] (there are multiple feasible futures for a given initial frame). In practice, this manifests itself in frame blurring – a gradual loss of details in the frame. To alleviate this effect, Mathieu [PredBeyond] used an adversarial loss [GAN]. Liang [LiangLDX17] further show that adversarial learning of the pixel flows leads to better generalisation.
Unconditional video generation. These more recent methods are based on the GAN framework [GAN] and incorporate some of the insights from their conditional counterparts. For example, Vondrick [VideoGAN] decouple the active foreground from a static background by using an architecture with two parallel generator streams. Saito [TGAN] use two generators to disentangle the video representation into distinct temporal and spatial domains. Following [VillegasYHLL17], the stateoftheart MoCoGAN of Tulyakov [MoCoGAN] decomposes the latent representation into content and motion parts for finer control over the generated scene. In addition, the discriminator in the MoCoGAN model is separated into image and video modules. While the image module targets the visual quality of individual frames, the focus of the video discriminator is the temporal coherence.
Evaluating unconditional video generators. Borrowed from the image generation literature [ImprovedGAN], the Inception Score (IS) has become one of the established metrics for quality assessment in videos [TGAN, MoCoGAN, VideoGAN]
. IS incorporates the entropy of the class distributions obtained from a separately trained classifier. Therefore, it is only meaningful if the training data distribution of the classifier matches the one on which it will be evaluated later. Following
[HeuselRUNH17], Unterthiner [FrechetVD] recently proposed the Fréchet Video Distance (FVD) that compares the distributions of feature embeddings of real and generated data.However, these metrics provide only a holistic measure of the video quality and do not allow for a detailed assessment of its individual properties. One of the desirable qualitative traits of video generators is their ability to produce realistic videos of arbitrary length. Yet, the established experimental protocol evaluates only on video sequences of a fixed length. Indeed, some previous work [TGAN, VideoGAN] is even tailored to a predefined video length, both at training and at inference time.
3 MDP for Video Generation
To motivate MDP for video generation, we first review MoCoGAN [MoCoGAN] and discuss its limitations. After a short presentation of the MDP formalism ([RLBook] for a comprehensive introduction), we then integrate MDP into MoCoGAN to incorporate knowledge of the infinitetime horizon into the generative process.
: a sequence of 3Dconvolutional layers with layerspecific dilations and strides. The input to the next convolutional layer is the output of the previous one. The last layer produces the immediate rewards
and the , the values are produced by the same network, .3.1 Preliminaries
MoCoGAN.
Figure 1(a) illustrates the main components of MoCoGAN: the generator, the image discriminator, and the video discriminator. At every timestep, the stochastic generator emits one frame and maintains a recurrent state perturbed by random noise. The image discriminator provides feedback for a single image; the video discriminator evaluates a contiguous subsequence of frames of a predefined length . The training objective is specified by the familiar maxmin game
(1) 
where and are samples from the training data, the generator provides and , and and are defined by the scalar scores of and [GAN, MoCoGAN].
We find that MoCoGAN’s samples exhibit looping and freezing patterns (see Sec. 5.4 for results and analyses). The intuitive reason comes from the specifics of training: to save memory, the training samples contain only subsequences of the complete video. As a result, the gradient signal from the video discriminator is unaware of the frames following the subsequence. The predefined length of the subsequence ultimately determines the maximum length of a sample with a nonrepeating pattern.^{2}^{2}2To verify this, we also trained the MoCoGAN model on longer subsequences and found the breaking point to occur at a correspondingly later timestep.
MDP. In an MDP defined by the tuple , the agent interacts with the environment by performing actions, , based on the current state, . The environment specifies the outcome of the action by returning a reward, , and the next state, . The goal of the agent is to find the optimal policy , maximizing the discounted cumulative reward
(2) 
where is the discount factor to ensure the convergence of the sum.
In the context of an MDP the generator plays the role of the agent’s policy. The frames predicted by are the actions. The hidden recurrent state becomes the agent’s state . The additive noise at every timestep determines the transition function . A frame incurs a reward as the score provided by the discriminators. Due to the deterministic mapping , the MoCoGAN’s corresponds to a deterministic policy [SilverLHDWR14] (the sampling in Eq. 2 becomes an equality). The optimization task for the agent is a search for the optimal policy :
(3) 
Observe that the MoCoGAN objective for is equivalent to only the first term of Eq. 3, the immediate reward, since the computes only a single score for a given video sample. In contrast, we also consider the future rewards, the second term of Eq. 3. To this end, we decompose the score of the video generator into immediate rewards associated with individual frames. We then learn a utility Qfunction approximating the expected cumulative reward, . Its definition is also known as Bellman’s optimality principle:
(4) 
By training the generator to maximize the Qfunction instead of just the immediate reward, we arrive at an approximate solution of Eq. 3. In the next section, we detail how MoCoGAN can be extended to this setup.
3.2 Integrating MDP into MoCoGAN
We need the model implementing the MDP to comply with two requirements:

The Markov property needs to be fulfilled, the next state given the previous state is conditionally independent from the past history .

By causality, the immediate reward is a function of the current state and the action and incorporates no knowledge about future actions.
The MoCoGAN generator already satisfies the Markov property using a parametrized RNN mapping from the current state to the next. However, the video discriminator has to be modified to satisfy the second requirement. This modification is straightforward to implement and leads to a variant of the Temporal Convolutional Network (TCN) [TCN].
Figure 1(b) gives an overview of the proposed MDPextension for the video discriminator. The key property of this design is that the output – a scalar – corresponds to a temporal receptive field of the frames up to the timestep. In this way the immediate reward will capture only the relevant motion history. Fortunately, adapting the MoCoGAN video discriminator to this architecture is straightforward (supplemental material for more details).
To implement Eq. 4, alongside we also predict another timedependent scalar, the Qvalue. As discussed in Sec. 3.1, the purpose of the Qvalue is to approximate the expected cumulative reward, . We use the squared difference loss, defined for each timestep by
(5) 
where is the discounting factor specifying the lookahead span: larger values encourage the Qvalue to account for the future outcome far ahead; low values focus the Qvalue on the immediate effect of the current frames.
Our TCNbased ensures that the parameters for predicting are now shared for all . As a result, Eq. 5 forces even the last to incorporate knowledge of rewards beyond the temporal window of size . Hence, by maximizing , the generator will implicitly maximize the rewards for . Contrast this to the original producing a single score for the complete frame sequence: due to lack of causality, the generator is “unaware” that at inference time the requested video length may exceed .
Note that the definition in Eq. 5 is confined to a limited time window of length to ensure that the memory consumption remains manageable. Now, our task is to train the generator by maximizing the Qvalue incorporating the longterm effects of individual predictions. However, since we keep fixed, each consecutive in Eq. 5 will be optimized to the sum containing one term fewer. That is, will approximate a sum of immediate rewards, a sum of terms, . As a result, incorporates the effect of the frame on future frames, whereas will only observe the influence of the frame on the last prediction. It is therefore evident that the Qvalues are not equally informative for modeling the longterm dependencies as supervision to the generator.
To reflect this observation in our training, we introduce an additional discounting factor that shifts the weight of the longterm supervision to the first frames, but offsets the reliance on the Qvalue for the last predictions. Concretely, the new term in the generator loss is
(6) 
To summarize, extending the original MoCoGAN training objective (Eq. LABEL:eq:mocogan_objective) into our MDPbased GAN yields equationparentequation
(7a)  
(7b) 
Here, we split the original objective in LABEL:eq:mocogan_objective into the discriminator and generatorspecific losses for illustrative purposes although the joint nature of the maxmin optimization problem remains. Following standard practice [GAN], we optimize the new objective by alternately updating the discriminators using Eq. 7a and the generator using Eq. 7b.
4 Quantifying Temporal Diversity
Motivated by our observation of the looping and freezing artifacts (see Fig. 1), we propose an interpretable way to quantify the temporal diversity of the video. Here, our assumption is that realistic videos comprise a predominantly unique sequence of frames. The idea then is to compare the predicted frame to the preceding ones: if there is a match, this indicates a reoccurring pattern in the sequence.
Let be a sequence of frames predicted by the model. Our diversity measure relies on a distance function of choice between arbitrary frames as
(8) 
where we use prefix “t” for disambiguation. Eq. 8 essentially finds the most similar preceding frame and averages the distance over all such pairs in the sequence. The obvious dual of this metric is to replace the distance function in Eq. 8 with a similarity measure and substitute the for the operation. In this work, we use two instantiations of Eq. 8: the tDSSIM employs the structural similarity (SSIM) [MetricSSIM] in the distance function ; tPSNR utilizes the peak signaltonoise ratio (PSNR) as a similarity measure. Hence, higher tDSSIM and lower tPSNR indicate higher diversity of frames within a sequence. We show next that despite its apparent simplicity, our proposed metric effectively captures deficiencies in frame diversity.
5 Experiments
5.1 Datasets
Following the established evaluation protocol from previous studies [TGAN, MoCoGAN], we use the following benchmarks:

Human actions [HumanActions]: The dataset contains 81 videos of 9 people performing 9 actions, walking, jumping, etc. All videos are extracted with fps and downscaled to pixels. We also add a flipped copy of each video sequence to the training set. Following Tulyakov [MoCoGAN] we used only 4 action classes, which amounts to 72 videos for training in total.

UCF101 [UCF101]: This dataset consists of 13 220 videos with 101 classes of human actions grouped into 5 categories: humanobject and humanhuman interaction, body motion, playing musical instruments, and sports. This dataset is challenging due to a high diversity of scenes, motion dynamics, and viewpoint changes.

TaiChi: The dataset contains 72 Tai Chi videos taken from the UCF101 dataset.^{3}^{3}3Note that the Tai Chi subset used in the evaluation of MoCoGAN [MoCoGAN] is not publicly available and could not be obtained due to licensing restrictions. All videos are centered on the performer and downscaled to pixels. We use this dataset for our ablation studies as it has moderate complexity, yet represents realworld motion.
5.2 Overview
We first verify that tPSNR and tDSSIM effectively quantify the temporal artifacts. We then employ these metrics to analyze the MoCoGAN model [MoCoGAN]
these artifacts. Next, we study the effect of the timehorizon hyperparameters,
and , of our MDP approach. Finally, we validate our approach on the Human Actions dataset and on the more challenging UCF101 dataset. We compare our model to TGAN [TGAN] and MoCoGAN, where we find a consistent improvement of the temporal diversity over the baseline.We compute the IS following Saito [TGAN], who trained the C3D network [C3D] on the Sports1M dataset [Sport1M] and then further finetuned on UCF101 [UCF101]. For FVD we use the original implementation by Unterthiner [FrechetVD]
. To manage computational time, we calculate the FVD for the first 16 frames, sampled from 256 videos, and derive the FVD mean and variance from 4 trials, similar to IS.
Metric 


MDP model  







1.63 0.05  4.49 0.04  4.52 0.05  4.15 0.04  4.24 0.06  3.92 0.07  3.99 0.04  

118 5  828 38  1108 50  787 10  782 40  744 40  809 22  

0.0135  0.0031  0.0031  0.0024  0.0037  0.0035  0.0035  

36.48  45.37  57.34  50.16  44.87  44.39  45.06  
5.3 Metric evaluation
We design a set of proofofconcept experiments to study the properties of the newly introduced tPSNR and tDSSIM. Concretely, we synthesize the looping and freezing patterns in the groundtruth videos from UCF101 and Tai Chi. We construct 16 frames by sampling 8 frames directly from the dataset and completing the sequence with an artifact counterpart. LoopingFWD contains a repeating subsequence from the original video (Original), whereas LoopingBWD reverses the frame order. The size of the reoccurring subsequence in Freezing is one. To put the results in context, we also compare to the mainstream IS as well as the recent FVD scores and study the robustness of all metrics to additive Gaussian noise . The results are summarized in Table 4.
We observe that tPSNR and tDSSIM correlate well with the more sophisticated IS and FVD. Recall that both IS and FVD require training a network on videos of fixed length, hence (i) can be computed only for shortlength videos, due to GPU constraints; (ii) may be misleading (Tai Chi results in Table 4) when the training data for the inception network is different from the evaluated data. By contrast, tPSNR and tDSSIM prove to be faithful in quantifying the artifacts we study, as they are dataagnostic and accommodate videos of arbitrary length. However, our metrics are permutation invariant, do not assess the quality of the frames themselves, and are not robust to random noise. Hence we stress their complementary role to IS and FVD as a measure of the overall video quality.
5.4 MoCoGAN: a case study
Here, we study the temporal diversity of the MoCoGAN model [MoCoGAN] using our tPSNR and tDSSIM scores.
We train MoCoGAN^{4}^{4}4We use the publicly available code provided by the MoCoGAN authors at https://github.com/sergeytulyakov/mocogan. on the UCF101 dataset with temporal windows of size , and apply our temporal metrics to the samples from the generator. To enable a more detailed view of the temporal dynamics, we inspect the video samples as a function of time in Fig. 4 by plotting the values of the summands in Eq. 8 for each timestep. To rule out the possibility of any degenerate phenomena in the original data, we also plot the corresponding curves of the groundtruth sequences alongside. This clearly shows that MoCoGAN exhibits a vanishing diversity of video frames – a pattern that is not found in the training data.
5.5 MDP approach: an ablation study
Here, we perform an ablation study of our MDP approach by varying the timehorizon hyperparameters, and , introduced in Sec. 3.2. Recall that controls the timespan of the future predictions modeled by the Qvalue: lower values imply a shorter time horizon, whereas higher values encourage the model to learn longterm dependencies. Parameter , on the other hand, specifies how accounting for the longterm effect is distributed over the timesteps. High values specify equal distribution; lower values force the model to encode the longterm effects more in the earlier than in the later timesteps. As a boundary case, we also consider and to gage the effect of the architecture change in the video discriminator (TCN), which is needed to implement reward causality (Sec. 3.2). As quantitative measures, we use the Inception Score (IS) [ImprovedGAN], the Fréchet Video Distance (FVD) [FrechetVD], as well our temporal metrics, tDSSIM and tPSNR, introduced in Sec. 4.
The results in Table 1 show that by leveraging the increasing values of the timehorizon hyperparameters, our model clearly improves the temporal diversity in terms of tPSNR and tDSSIM. Moreover, we also observe that the TCN baseline (, ) performs worse than the original MoCoGAN in terms of temporal diversity. This is easily understood when considering that the TCN alone does not have any lookahead into the future (Fig. 1(b)). However, once we enable taking the future rewards into account by virtue of our MDP formulation, we not only reach but actually surpass the temporal diversity of the baseline MoCoGAN, as expected.
The somewhat inferior IS and FVD scores might be due to their sensitivity to the data prior, as discussed in Sec. 5.3. This hypothesis is also supported by a qualitative comparison between MoCoGAN and our MDP model. Figure 5 gives one such example; more results can be found in the supplemental material. While we observe no notable difference in perframe quality, the motion between consecutive frames from our MDP model is more apparent than the samples from MoCoGAN (, the torso of the performer).
Model  K  Human Actions  UCF101  
IS  FVD  tDSSIM  tPSNR  IS  FVD  tDSSIM  tPSNR  
Raw dataset  –  3.39 0.08  49 2  0.0815  23.35  40.80 0.26  452 49  0.0723  28.34 
TGAN (Normal)  16  2.90 0.04  977 31  –  –  8.11 0.07  1686 24  –  – 
TGAN (SVC)  16  3.65 0.10  227 10  –  –  11.91 0.21  1324 23  –  – 
MoCoGAN  16  3.53 0.02  300 8  0.0259  33.76  11.15 0.10  1351 49  0.0337  33.29 
MoCoGAN  16  3.51 0.02  245 6  0.0243  34.79  11.48 0.15  1314 45  0.0358  33.61 
MoCoGAN  24  3.47 0.02  318 9  0.0254  35.72  10.49 0.09  1352 49  0.0387  32.63 
MDP0 (ours)  16  3.55 0.03  1413 15  0.0559  33.31  6.16 0.08  2147 87  0.0160  47.36 
MDP (ours)  16  3.55 0.02  641 8  0.0604  30.12  11.86 0.11  1277 56  0.0370  32.77 
MDP (ours)  24  3.49 0.03  686 12  0.0661  29.39  12.14 0.18  1293 58  0.0454  31.05 



5.6 Human Actions and UCF101
We perform further experiments on the Human Actions and the more challenging UCF101 datasets.^{5}^{5}5To ensure a fair comparison, we use the same inception network for IS and FVD and train other methods [TGAN, MoCoGAN] using the authors’ implementation (supplemental material for details). We select for our MDP model, which provide a good tradeoff between the improved tPSNR, tDSSIM, FVD and only a slight drop of IS on Tai Chi (Sec. 5.2). For reference, we train the TCN baseline, MDP0, by setting and to decouple the influence of modeling the longterm effects from the changes in the MoCoGAN architecture to comply with reward causality. We also train our MDP model and MoCoGAN on an extended temporal window . Recall that higher require more GPU memory, but give the model an advantage, since it observes longer sequences at training time. Therefore, we aim to mitigate the artifacts while keeping constant.
The quantitative results are summarized in Table 2. For both the Human Actions and UCF101 datasets, we observe a consistent improvement of our MDP model in terms of temporal diversity measured by tPSNR and tDSSIM. Moreover, our model also outperforms MoCoGAN in terms of IS on both datasets, as well as FVD on the UCF101 dataset. This can be explained by the more varied nature of motion on these datasets compared to the Tai Chi dataset, which makes taking into account future frames more important. On the Human Actions dataset, the FVD score for our model is inferior to MoCoGAN. Recall from Sec. 5.2, that for IS and FVD metrics we did not finetune the inception classifiers on the Human Actions dataset, which impedes the interpretability of the scores on this dataset. A visual inspection of the perframe quality (Fig. 6 for examples) reveals no perceptual loss compared to the baseline model. In contrast, disabling MDP modeling (MDP0) leads to a clear deterioration in video quality.



On both datasets, our model with is also superior to MoCoGAN with in terms of IS and FVD, and reaches on par performance in terms of tPSNR and tDSSIM. Yet, our MDPbased formulation is significantly more memory efficient, since extending the temporal window at training incurs addition memory costs. Concretely, at training time the MDP model with consumes roughly more memory than MoCoGAN, whereas setting for the original MoCoGAN incurs a higher memory footprint. Note that simply increasing the number of parameters of in MoCoGAN is less effective than our proposed MDP approach (see MoCoGAN in Tab. 5.6). Also, our MDP model with improves further over on UCF101 and regarding the temporal metrics on Human Actions. A visual inspection of the samples from Human Actions did not reveal any perceptible difference to MoCoGAN or our MDP with , despite the inferior IS and FVD scores; we believe this to be an artifact of the evaluation specifics. The IS score of our MDP model is slightly inferior only to TGAN [TGAN]. However, TGAN can produce video sequences of only fixed length, whereas our MDP model can generate videos of arbitrary length, owing to the recurrent generator.
The qualitative results in Fig. 7 show that our model can generate complex scenes from UCF101 that are visually comparable to the MoCoGAN samples. Similar to our observation on Human Actions, MDP0 produces poorer samples, which asserts the efficacy of the underlying MDP. Since the interpretation of the UCF101 results is difficult, we examine a visualization of a pairwise distance between two frames in the video, shown in Fig. 8. The distance matrix can be represented as a lower triangular twodimensional heatmap, owing to the symmetry of . We observe that while MoCoGAN exhibits a looping pattern, our MDP approach tends to preserve the temporal qualities of the groundtruth datasets. Note that some samples in Human Actions can be naturally periodic (handwaving), hence, we do not expect our model to dispense with the looping pattern completely. The overall results suggest that modeling longterm dependencies with an MDP consistently leads to more diverse motion dynamics, which becomes more apparent in increasingly complex scenes.



6 Conclusions and Future Work
We revealed two pathological cases in the videos synthesized by the stateoftheart MoCoGAN model, namely freezing and looping. To quantify the temporal diversity, we proposed an interpretable class of metrics. We showed that the SSIM and PSNRbased metrics, tPSNR and tDSSIM, effectively complement IS and FVD to quantify temporal artifacts. Next, we traced the artifacts to the limited training length, which inhibits longterm modeling of the video sequences. As a remedy, we reformulated video generation as an MDP and incorporated it into MoCoGAN. We showed the efficacy of our MDP model on the challenging UCF101 dataset both in terms of our temporal metrics, as well as in IS and FVD scores. Maintaining the recurrent state between the training iterations or imposing a tractable prior on the state suggest promising extensions of this work toward generating longsequence videos.
Acknowledgements. The authors thank Sergey Tulyakov and Masaki Saito for helpful clarifications.
References
Appendix A Overview
We elaborate on the evaluation protocol used in our study, as well as provide additional qualitative examples both from our approach and MoCoGAN MoCoGAN. To enable reproducibility of our approach, we detail the architecture of our MDPbased video discriminator and the training specifics of our MDP approach.
Appendix B A Note on Reproducibility
In the main text, we indicated a discrepancy between the Inception Score (IS) we attained on UCF101 UCF101 and the IS reported in the original work TGAN, MoCoGAN. Recall that we compute the IS following Saito TGAN, who trained the C3D network C3D on the Sports1M dataset Sport1M and then further finetuned it on UCF101 UCF101. We calculate the IS by sampling the first 16 frames from 10K videos and determining the mean and variance over 4 trials. As in previous work TGAN, MoCoGAN, we use the first training split of the UCF101 dataset.^{6}^{6}6More details on the splits of the UCF101 dataset are available at http://crcv.ucf.edu/data/UCF101.php. We use original authors’ implementation, for both MoCoGAN^{7}^{7}7MoCoGAN repository provided at https://github.com/sergeytulyakov/mocogan. and TGAN^{8}^{8}8Code repository by TGAN, provided at https://github.com/pfnetresearch/tgan. and train the respective models for 100K iterations. We did not observe further improvements of the IS for longer training schedules.
To facilitate transparency and reproducibility of our experiments, we highlight two contributing factors that we carefully considered in our evaluation: IS implementation and model selection.
Data: Generated Video
Result: Inception Score
Bicubic interpolation Normalize Forward pass

IS implementation. We found the IS to be sensitive to subtle differences in implementation. Recall that we use the original TGAN TGAN evaluation code in our experiments. AlgorithmA in Fig. 8(a) shows the main steps of this evaluation for a single sample of video. While C3D C3D is trained for a resolution of , video generators produce sequences of resolution
. Additionally, C3D requires the input image sequence to be normalized with the mean and standard deviation used for the network training. The TGAN evaluation (
Fig. 8(a)) upsamples the videos to , normalizes them, and feeds the center crop of into the C3D network. However, an alternative evaluation, shown by AlgorithmB in Figure 8(b), is to upsample the video directly to , applying the normalization at that scale, and feeding the result into C3D without cropping. This subtle change in the evaluation leads to a tangible difference in the Inception Score. We summarize the results in Table 11 and compare these two versions of evaluation to the reported scores in the original works TGAN,MoCoGAN.[1.15]
Method  Inception Score  
Reported  Reproduced AlgorithmA  Reproduced AlgorithmB  
MoCoGAN MoCoGAN  12.42 0.03  11.15 0.10  12.03 0.07 
TGANNormal TGAN  9.18 0.11  8.11 0.07  9.90 0.06 
TGANSVC TGAN  11.85 0.07  11.91 0.21  14.04 0.08 
MDP (ours)  11.86 0.11  11.86 0.11  13.00 0.07 
[0.9]
Leaving out cropping for IS computation with AlgorithmB leads to higher values of the IS in comparison to AlgorithmA. Note that the IS for MoCoGAN produced by AlgorithmB is closer to the reported values: it scores , which approaches the reported score of . However, the opposite is the case for TGANSVC as expected, since AlgorithmA is the unaltered version of the evaluation provided by the TGAN authors TGAN (we attribute the discrepancy for TGANNormal to model selection, which we will discuss shortly). Importantly, regardless of the evaluation protocol our MDP model always outperforms the MoCoGAN baseline and achieves with AlgorithmB.
Although the choice of AlgorithmB over AlgorithmA does not change the ranking of the methods, we emphasize that mixing the results produced by the two algorithms does and can essentially invalidate the experimental conclusions. Therefore, we stick to AlgorithmA for all methods in the experiments that we presented in the main text.
We conclude that the specifics of IS implementation lead to tangibly different results and stress the importance of using the same evaluation strategy for all methods in the experiments. Moreover, we additionally report the Fréchet Video Distance (FVD) as well as our two new metrics, PSNR and DSSIM, in the main paper to provide a complementary view to the IS.
Model selection. Recall that the IS for TGANNormal using AlgorithmA, , is still inferior to the reported score of (Table 11). To investigate this discrepancy, we observe that the standard training objective GAN used to train video generation models TGAN,MoCoGAN serves only as a proxy criterion for the Inception Score. As a result, lower training loss, or even convergence of training, does not necessarily imply an improvement in the Inception Score. Figure 11 illustrates this observation: the Inception Score fluctuates over the course of training, even after a considerable number of training iterations. Indeed, the IS achieved by TGANNormal at around 30K iterations is the highest, , which is closer to the reported by the TGAN authors TGAN (in fact better). Although it is disputable which strategy of selecting the final model for evaluation is more meaningful, we argue against searching for the best IS across training iterations. One reason is that there is no conventional traintest split for video generation, hence computing the IS for intermediate models amounts to “peeking” into the testtime performance of the model. Selecting the best IS also inhibits reproducibility. Since IS fluctuations are rather random, there can be no fixed training schedule defined apriori to reproduce the result. Additionally, computing the IS is computationally expensive, as it requires thousands of forward passes with a pretrained classification network (C3D).
We believe that more transparency both at the stage of model selection and IS implementation can improve reproducibility of the Inception Score. We adhere to this principle and ensure a predefined training schedule and exactly same evaluation methodology to enable a fair and reproducible experiments.
Model  Conv3D  BatchNorm BatchNorm  LeakyReLU maas2013rectifier  
Filters  Kernel  Stride  Padding  Dilation  
64  4,4,4  1,2,2  0,1,1  1,1,1  ✓  ✓  
128  4,4,4  1,2,2  0,1,1  1,1,1  ✓  ✓  
256  4,4,4  1,2,2  0,1,1  1,1,1  ✓  ✓  
1  4,4,4  1,2,2  0,1,1  1,1,1  
TCN  64  3,4,4  1,2,2  2,1,1  1,1,1  ✓  ✓ 
128  3,4,4  1,2,2  4,1,1  2,1,1  ✓  ✓  
256  3,4,4  1,2,2  8,1,1  4,1,1  ✓  ✓  
1  1,4,4  1,2,2  0,1,1  1,1,1  
Appendix C Additional Qualitative Examples
We make additional qualitative results available at https://sites.google.com/view/mdpforvideogeneration. The examples provide a visual comparison of MoCoGAN and our MDPbased model on the Tai Chi, Human Actions, and UCF101 datasets.
Appendix D Implementation Details
d.1 Architecture
Recall that our MDP approach is based on an extension of the video discriminator from the original MoCoGAN model to a TCNlike model implementing reward causality. As Table 3 details, we only modify the temporal domain and replace the standard convolutions with their dilated variants. The hyperparameters of the TCN are the number of the dilated layers (blocks) and the convolution kernel size. Following Bai TCN, we set both hyperparameters to , since in such configuration the receptive field of the last TCN output covers the entire input sequence of frames. Note that the TCN version does not increase the number of parameters of the original , but even reduces it due to a smaller kernel size in the temporal domain. It is only the addition of the Qvalue approximation that slightly increases the model capacity of . The architecture of the image discriminator and the generator remain in their original form MoCoGAN.
d.2 Training details
We follow the training protocol of MoCoGAN MoCoGAN and use the ADAM optimizer ADAM for training all networks with a learning rate of
and moment hyperparameters
, . As regularization, we only use weight decay of . We leave the settings for the motion and the content subspaces of MoCoGAN to their default parameters with and the motion dimension set to . We also keep the additive noise for images fed to the discriminators.In order to be compatible with the MoCoGAN evaluation (Appendix B), all networks are trained for iterations with a minibatch size of , while the seed is kept constant () to ensure equivalent parameter initialization across the experiments. For our ablation study on the TaiChi dataset, we used a batch size of and trained for iterations. The training length for the models is set to frames, unless explicitly stated otherwise. We sample the original data with a fixed sampling stride of , as in MoCoGAN training.^{9}^{9}9The documentation for training MoCoGAN is provided at https://github.com/sergeytulyakov/mocogan/wiki/TrainingMoCoGAN. This means that in order to acquire frames for training, we extract a frame sequence with a random starting point from the ground truth dataset and take every second frame. This procedure increases the amount of motion by reducing the fps of the original video sequence, since on the UCF101 dataset the originally sampled frames provide imperceptible changes to scene appearance.
ieee_fullname egbib_supp
Comments
There are no comments yet.