1 Introduction
Generative adversarial networks (GANs) [1]
were recently introduced as an unsupervised approach to generative modeling, employing gametheoretic training schemes in order to learn a given probability density, implicitly defined by training data. Under this setting, two models are trained jointly. The generator tries to map low dimensional samples from some simple prior distribution to higherdimensional structured data, while the discriminator, on the other hand, tries to determine whether samples are genuine or generator outputs. To date, stateoftheart results have been obtained for GANbased generative modeling of images
[2, 3] and audio, if imagelike spectrogram representations are used [4, 5]. However, their applications in other domains, such as temporal or discrete data, remain open problems under active investigation. Here, we direct our focus to adversarially learned video modeling.A common strategy in recent attempts on training GANs for natural scenes generation focuses on splitting the task into simpler parts. In [6], for instance, there are independent modules for foreground and background modeling. In both [7] and [8], motion and frame content are learned by different parts of the architecture designed specifically for each of those aspects. In turn, in [9] authors tackle the problem by conditioning generation on optical flows provided a priori. However, in all such cases, even though the model architectures are designed aiming to focus on different aspects of video generation, training is performed together, which might yield relevant training difficulties such as mode collapse and divergence [10]. This is due to the higher dimensionality of videos which also include a temporal component.
In this work, we further exploit the idea of splitting the video generation process into smaller and simpler components. Frame content and motion modeling are achieved by independent blocks within the complete model: a convolutional block responsible to map a lowdimensional vector into a frame, and a recurrent block intended to receive a fixeddimension vector input and to output a sequence of vectors to be used as inputs into the convolutional block, thus yielding a sequence of frames. Moreover, a new training scheme is devised on top of the proposed setting to avoid common issues faced when training GANs. Each of the above mentioned components, i.e. the generative model of frames as well as the generative model of sequences of frames, are trained separately. More specifically, the multidiscriminator setting introduced in
[11] is used in both steps to further stabilize training and produce diverse generators.Under the described setting, the frames generator can be seen as a parametric representation of the manifold of video frames, i.e. a mapping from a much lowerdimensional space to actual frames. This model is trained first and, as we obtain good samples in terms of quality and diversity, the sequence component is trained. The sequence generator, in turn, is a recurrent model trained with the goal of learning how to effectively traverse the manifold induced by the pretrained frames generator in such a way that yields coherent frame sequences.
2 Generative Adversarial Networks
GANs are generally composed of a discriminator model , where is the dimensionality of the input space, and a generator , where is the size of an input noise vector . receives a sample from the data distribution or a sample from the generator , . During training, its goal is to learn how to tell apart these two different types of inputs. The generator, on the other hand, aims at fooling the discriminator by learning how to produce samples as close to the data distribution as possible. GAN training was originally defined as a minmax game, but here we utilize the nonsaturating game, as defined in [10]. According to this training scheme, the discriminator loss and the generator loss are respectively defined as
(1) 
(2) 
With the success of GANs and its popularization, deeper analyses have shown that these models may suffer from instability during training [12, 13] which can lead to lack of diversity and poor quality on the generated samples. In order to alleviate these issues, many GAN variations were proposed in the last few years [11, 14, 15]. One interesting approach proposed in [11] consists in using multiple discriminators where each one considers as input a lowdimensional randomly projected version of the original input. The authors empirically showed that this method yielded more stable training and provided more diverse and better quality generated samples.
3 Proposed Model and Training
The proposed method relies on two main components: (i) a convolutional frame generator , and (ii) a recurrent model for generating videos . The goal is to disentangle image quality and temporal coherence components of a video and letting each of the generative models individually focus on one of these two aspects. By doing so, the performance of the model relies on the capability of the frame generator to provide good and diverse images as well as on the sequence generator to be able to sequentially sample frames (i.e. navigate through the frames manifold induced by ) in a coherent order.
One of the main challenges in such an approach is to be able to train with enough diversity. Several methods have been proposed recently targeting mode dropping in the GAN setting [14]. In our experiments, we found the multiplediscriminators approach introduced in [11] to yield better stability during training, as well as higher sample quality and diversity. Training follows the usual steps, i.e. each discriminator is separately updated, but when updating the generator parameters, the average of discriminator losses is considered. Thus, instead of using (2) as the generator loss during training, we use (3) instead, namely:
(3) 
where indicates the output of the th discriminator and the total number of discriminators. Training of was performed with discriminators. An architecture similar to DCGAN [2] was employed.
is composed of three main building blocks: an encoding stack of dense layers responsible to map a noise vector into a sequence of highdimensional vectors. This sequence is fed into a bidirectional recurrent block that computes a sequence of temporally dependent noise vectors which are then used to sample from . Finally, for the case of videos with length , the output is obtained by sampling times from the frames generator and ordering the samples to form the final sequence . The described framework is represented in Fig. 1. The encoder (trapezoid) is parametrized by fullyconnected (FC) layers, and the recurrent model by a twolayer bidirectional LSTM.
The scheme proposed in [11] was also used to train the sequence generator. In this case, we utilized 16 discriminators which inputs are reduceddimension random projections of each frame composing the video input. It is important to highlight that parameters are kept unchanged during the training of . The architectures used for the video generation GAN were: 1) Generator: FC; 2) Discriminator: similar to [2] but with 3D convolutions in the place of 2D in order to take into account the temporal dimension. Random projections were implemented as norm convolutions.
4 Experimental Results
In order to evaluate the proposed method, we performed two main experiments. First, we aim to show that our approach is able to generate videos with both frame quality and temporal coherence. For that, we train the frame generator using frames from the same videos used from training the videos generator. Overall, our goal is to investigate what the video generator is learning in terms of navigation throughout the implicit manifold parametrized by . To this end, we plot the 2dimensional isomap [16] of generated latent variables by the video generator resulting from the first experiment. We thus built a training dataset composed of 100,000 samples from bouncing balls data [17]
. Each example consists of 30 frameslong videos with three balls bouncing. Randomly sampled frames from the same set of videos were used to train the frames generator in advance. RMSprop optimizer with learning rate equal to
and was employed to train and , respectively.was trained for 50 epochs with minibatches of size 64, while 15 epochs were used for
with minibatches of size 8. Random seed was previously set to 10 before all experiments. A single NVIDIA GTX 1080Ti was used for training. A Pytorch
[18] implementation is available at Github [19].4.1 Generating videos
In Fig. 2, we show samples randomly drawn from the frame generator. By visual inspection, we notice that, as desired, good quality and diversity were obtained. Using this model as , we train and show random samples in Fig. 3b. To provide a reference for comparison, we also show in Fig. 3a three randomly selected video samples drawn from the real data distribution. Each frame is plotted individually such that time increases from left to right. Visual inspection of generated sequences of frames indicates that both the quality of individual frames (as ensured by the frame generator) and temporal coherence were close to original samples. More specifically, we notice that most of the transitions between frames are as smooth as in the original data samples.
We further highlight that the generated video samples are diverse, which suggests the proposed training scheme is effective in avoiding strong mode collapse. Nonetheless, failure cases do still occur. For example, on the first row of Fig. 3b we observed an undesired nonsmooth transition from the fourth to last to the third to last frames. We also noticed that the temporal dynamics of the videos shown on the second and third rows of Fig. 3b is very similar, even though a framewise comparison shows that the videos are not the same. We refer to this effect as partial modecollapse and believe this could be mitigated by increasing the number of discriminators when training ; this is left for future study.
Moreover, time coherence was also studied in a case where and were trained using different datasets, namely bouncing balls with 1 and 3 balls, respectively. Notice that both one and three bouncing balls datasets have similar temporal dynamics and no further training to finetune after replacing was executed. Three samples from the new video generator are shown in Fig. 4, from which one can notice that, even though the dynamics is not perfectly preserved in all frame transitions, as in some cases the ball changes its trajectory without hitting a wall first, smooth transitions between frames are still maintained. This simple experiment indicates that the video generator is indeed able to, at some extent, independently learn the temporal dynamics without specifically focusing on the content of each frame.
Finally, we objectively assessed smoothness by measuring the meansquared error (MSE) between consecutive frames. Average and standard deviation for 30 random samples drawn from models obtained using 3 and 1 bouncing balls are presented in Table
1. The same metrics are provided for real data and videos obtained using a random sequence of latent variables for comparison. For the 1ball case, generated samples are as smooth as real videos. For the 3 ball cases, in turn, aforementioned eventual nonsmooth transitions happen for sequences as long as 30 frames, as confirmed by the higher MSE, which is still much lower than random sequences.4.2 Investigating what the frame generator is learning
We plot the isomap of the sequence of latent variables for 6 videos in order to investigate what the frames generator is learning. We included in this plot samples randomly drawn from the prior with the aim of verifying whether is simply learning how to sample from the prior without any further knowledge. Another hypothesis we wanted to investigate is whether
is learning to linearly interpolate latent variables. For that, we plot in the isomap two sequences of latent variables obtained by linearly interpolating two random samples from the prior. Results are shown in Fig.
5.Mean  Std. dev.  

3 balls  Real data  0.0222  0.0005 
Proposed  0.0735  0.0057  
0.2060  0.0061  
1 ball  Real data  0.0222  0.0005 
Proposed  0.0227  0.0051  
0.0766  0.0023 
By observing the obtained plot, we notice that samples from the prior (green crosses) are spread across the plane, while linear interpolations (black triangles) are concentrated in particular regions of the plane. The set of latent variables obtained with (circles, different colors represent different videos), on the other hand, seem to have a different behavior. In some cases, small clusters of ’s belonging to the same video are located in different parts of the isomap, which lead us to conclude that the video generator learns to “jump” across the manifold defined by whenever it is necessary.
5 Conclusions and future directions
We introduced a novel approach for unsupervised generation of temporal data using GANs. The method aims to break the problem into frame and sequence generation, and to solve them separately, thus making both tasks easier. Evaluation is performed on unsupervised video generation, and generated video samples presented good quality and diversity per frame as well as temporal coherence. This approach further provides indications regarding the structure of the implicit manifold parametrized by GANs, something that still remains elusive in the literature. Visualization of latent variables after dimensionality reduction via isomap indicates that the videos manifold is not continuous, as latent representations corresponding to visually similar frames are not necessarily close in the isomap. This work opens directions of future research as a general scheme for generative modeling of timeseries. As such, we intend to apply the same approach to different domains. Further exploiting the video generation setting and including other objective video quality metrics is another target of future investigation.
References
 [1] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
 [2] Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
 [3] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.
 [4] Chris Donahue, Julian McAuley, and Miller Puckette, “Synthesizing audio with generative adversarial networks,” arXiv preprint arXiv:1802.04208, 2018.
 [5] Wilson Cai, Anish Doshi, and Rafael Valle, “Attacking speaker recognition with deep generative models,” arXiv preprint arXiv:1801.02384, 2018.
 [6] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba, “Generating videos with scene dynamics,” in Advances In Neural Information Processing Systems, 2016, pp. 613–621.

[7]
Masaki Saito, Eiichi Matsumoto, and Shunta Saito,
“Temporal generative adversarial nets with singular value clipping,”
inIEEE International Conference on Computer Vision (ICCV)
, 2017, vol. 2, p. 5.  [8] Sergey Tulyakov, MingYu Liu, Xiaodong Yang, and Jan Kautz, “Mocogan: Decomposing motion and content for video generation,” arXiv preprint arXiv:1707.04993, 2017.
 [9] Katsunori Ohnishi, Shohei Yamamoto, Yoshitaka Ushiku, and Tatsuya Harada, “Hierarchical video generation from orthogonal information: Optical flow and texture,” arXiv preprint arXiv:1711.09618, 2017.
 [10] Ian Goodfellow, “Nips 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160, 2016.
 [11] Behnam Neyshabur, Srinadh Bhojanapalli, and Ayan Chakrabarti, “Stabilizing gan training with multiple random projections,” arXiv preprint arXiv:1705.07831, 2017.
 [12] David Berthelot, Thomas Schumm, and Luke Metz, “Began: boundary equilibrium generative adversarial networks,” arXiv preprint arXiv:1703.10717, 2017.
 [13] Martin Arjovsky, Soumith Chintala, and Léon Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
 [14] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh, “Pacgan: The power of two samples in generative adversarial networks,” arXiv preprint arXiv:1712.04086, 2017.
 [15] Abhay Yadav, Sohil Shah, Zheng Xu, David Jacobs, and Tom Goldstein, “Stabilizing adversarial nets with prediction methods,” 2018.
 [16] Joshua B Tenenbaum, Vin De Silva, and John C Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.

[17]
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov,
“Unsupervised learning of video representations using lstms,”
inInternational conference on machine learning
, 2015, pp. 843–852.  [18] “Pytorch,” https://pytorch.org/docs/stable/index.html, 2018.
 [19] “FrameGAN Github Repository,” https://github.com/belaalb/frameGAN, 2018.
Comments
There are no comments yet.