Generative adversarial networks (GANs) 
were recently introduced as an unsupervised approach to generative modeling, employing game-theoretic training schemes in order to learn a given probability density, implicitly defined by training data. Under this setting, two models are trained jointly. The generator tries to map low dimensional samples from some simple prior distribution to higher-dimensional structured data, while the discriminator, on the other hand, tries to determine whether samples are genuine or generator outputs. To date, state-of-the-art results have been obtained for GAN-based generative modeling of images[2, 3] and audio, if image-like spectrogram representations are used [4, 5]. However, their applications in other domains, such as temporal or discrete data, remain open problems under active investigation. Here, we direct our focus to adversarially learned video modeling.
A common strategy in recent attempts on training GANs for natural scenes generation focuses on splitting the task into simpler parts. In , for instance, there are independent modules for foreground and background modeling. In both  and , motion and frame content are learned by different parts of the architecture designed specifically for each of those aspects. In turn, in  authors tackle the problem by conditioning generation on optical flows provided a priori. However, in all such cases, even though the model architectures are designed aiming to focus on different aspects of video generation, training is performed together, which might yield relevant training difficulties such as mode collapse and divergence . This is due to the higher dimensionality of videos which also include a temporal component.
In this work, we further exploit the idea of splitting the video generation process into smaller and simpler components. Frame content and motion modeling are achieved by independent blocks within the complete model: a convolutional block responsible to map a low-dimensional vector into a frame, and a recurrent block intended to receive a fixed-dimension vector input and to output a sequence of vectors to be used as inputs into the convolutional block, thus yielding a sequence of frames. Moreover, a new training scheme is devised on top of the proposed setting to avoid common issues faced when training GANs. Each of the above mentioned components, i.e. the generative model of frames as well as the generative model of sequences of frames, are trained separately. More specifically, the multi-discriminator setting introduced in is used in both steps to further stabilize training and produce diverse generators.
Under the described setting, the frames generator can be seen as a parametric representation of the manifold of video frames, i.e. a mapping from a much lower-dimensional space to actual frames. This model is trained first and, as we obtain good samples in terms of quality and diversity, the sequence component is trained. The sequence generator, in turn, is a recurrent model trained with the goal of learning how to effectively traverse the manifold induced by the pre-trained frames generator in such a way that yields coherent frame sequences.
2 Generative Adversarial Networks
GANs are generally composed of a discriminator model , where is the dimensionality of the input space, and a generator , where is the size of an input noise vector . receives a sample from the data distribution or a sample from the generator , . During training, its goal is to learn how to tell apart these two different types of inputs. The generator, on the other hand, aims at fooling the discriminator by learning how to produce samples as close to the data distribution as possible. GAN training was originally defined as a min-max game, but here we utilize the non-saturating game, as defined in . According to this training scheme, the discriminator loss and the generator loss are respectively defined as
With the success of GANs and its popularization, deeper analyses have shown that these models may suffer from instability during training [12, 13] which can lead to lack of diversity and poor quality on the generated samples. In order to alleviate these issues, many GAN variations were proposed in the last few years [11, 14, 15]. One interesting approach proposed in  consists in using multiple discriminators where each one considers as input a low-dimensional randomly projected version of the original input. The authors empirically showed that this method yielded more stable training and provided more diverse and better quality generated samples.
3 Proposed Model and Training
The proposed method relies on two main components: (i) a convolutional frame generator , and (ii) a recurrent model for generating videos . The goal is to disentangle image quality and temporal coherence components of a video and letting each of the generative models individually focus on one of these two aspects. By doing so, the performance of the model relies on the capability of the frame generator to provide good and diverse images as well as on the sequence generator to be able to sequentially sample frames (i.e. navigate through the frames manifold induced by ) in a coherent order.
One of the main challenges in such an approach is to be able to train with enough diversity. Several methods have been proposed recently targeting mode dropping in the GAN setting . In our experiments, we found the multiple-discriminators approach introduced in  to yield better stability during training, as well as higher sample quality and diversity. Training follows the usual steps, i.e. each discriminator is separately updated, but when updating the generator parameters, the average of discriminator losses is considered. Thus, instead of using (2) as the generator loss during training, we use (3) instead, namely:
where indicates the output of the -th discriminator and the total number of discriminators. Training of was performed with discriminators. An architecture similar to DCGAN  was employed.
is composed of three main building blocks: an encoding stack of dense layers responsible to map a noise vector into a sequence of high-dimensional vectors. This sequence is fed into a bi-directional recurrent block that computes a sequence of temporally dependent noise vectors which are then used to sample from . Finally, for the case of videos with length , the output is obtained by sampling times from the frames generator and ordering the samples to form the final sequence . The described framework is represented in Fig. 1. The encoder (trapezoid) is parametrized by fully-connected (FC) layers, and the recurrent model by a two-layer bi-directional LSTM.
The scheme proposed in  was also used to train the sequence generator. In this case, we utilized 16 discriminators which inputs are reduced-dimension random projections of each frame composing the video input. It is important to highlight that parameters are kept unchanged during the training of . The architectures used for the video generation GAN were: 1) Generator: FC; 2) Discriminator: similar to  but with 3D convolutions in the place of 2D in order to take into account the temporal dimension. Random projections were implemented as norm convolutions.
4 Experimental Results
In order to evaluate the proposed method, we performed two main experiments. First, we aim to show that our approach is able to generate videos with both frame quality and temporal coherence. For that, we train the frame generator using frames from the same videos used from training the videos generator. Overall, our goal is to investigate what the video generator is learning in terms of navigation throughout the implicit manifold parametrized by . To this end, we plot the 2-dimensional isomap  of generated latent variables by the video generator resulting from the first experiment. We thus built a training dataset composed of 100,000 samples from bouncing balls data 
. Each example consists of 30 frames-long videos with three balls bouncing. Randomly sampled frames from the same set of videos were used to train the frames generator in advance. RMSprop optimizer with learning rate equal toand was employed to train and , respectively.
was trained for 50 epochs with mini-batches of size 64, while 15 epochs were used for
with mini-batches of size 8. Random seed was previously set to 10 before all experiments. A single NVIDIA GTX 1080Ti was used for training. A Pytorch implementation is available at Github .
4.1 Generating videos
In Fig. 2, we show samples randomly drawn from the frame generator. By visual inspection, we notice that, as desired, good quality and diversity were obtained. Using this model as , we train and show random samples in Fig. 3b. To provide a reference for comparison, we also show in Fig. 3a three randomly selected video samples drawn from the real data distribution. Each frame is plotted individually such that time increases from left to right. Visual inspection of generated sequences of frames indicates that both the quality of individual frames (as ensured by the frame generator) and temporal coherence were close to original samples. More specifically, we notice that most of the transitions between frames are as smooth as in the original data samples.
We further highlight that the generated video samples are diverse, which suggests the proposed training scheme is effective in avoiding strong mode collapse. Nonetheless, failure cases do still occur. For example, on the first row of Fig. 3b we observed an undesired non-smooth transition from the fourth to last to the third to last frames. We also noticed that the temporal dynamics of the videos shown on the second and third rows of Fig. 3b is very similar, even though a frame-wise comparison shows that the videos are not the same. We refer to this effect as partial mode-collapse and believe this could be mitigated by increasing the number of discriminators when training ; this is left for future study.
Moreover, time coherence was also studied in a case where and were trained using different datasets, namely bouncing balls with 1 and 3 balls, respectively. Notice that both one and three bouncing balls datasets have similar temporal dynamics and no further training to fine-tune after replacing was executed. Three samples from the new video generator are shown in Fig. 4, from which one can notice that, even though the dynamics is not perfectly preserved in all frame transitions, as in some cases the ball changes its trajectory without hitting a wall first, smooth transitions between frames are still maintained. This simple experiment indicates that the video generator is indeed able to, at some extent, independently learn the temporal dynamics without specifically focusing on the content of each frame.
Finally, we objectively assessed smoothness by measuring the mean-squared error (MSE) between consecutive frames. Average and standard deviation for 30 random samples drawn from models obtained using 3 and 1 bouncing balls are presented in Table1. The same metrics are provided for real data and videos obtained using a random sequence of latent variables for comparison. For the 1-ball case, generated samples are as smooth as real videos. For the 3 ball cases, in turn, aforementioned eventual non-smooth transitions happen for sequences as long as 30 frames, as confirmed by the higher MSE, which is still much lower than random sequences.
4.2 Investigating what the frame generator is learning
We plot the isomap of the sequence of latent variables for 6 videos in order to investigate what the frames generator is learning. We included in this plot samples randomly drawn from the prior with the aim of verifying whether is simply learning how to sample from the prior without any further knowledge. Another hypothesis we wanted to investigate is whether
is learning to linearly interpolate latent variables. For that, we plot in the isomap two sequences of latent variables obtained by linearly interpolating two random samples from the prior. Results are shown in Fig.5.
|3 balls||Real data||0.0222||0.0005|
|1 ball||Real data||0.0222||0.0005|
By observing the obtained plot, we notice that samples from the prior (green crosses) are spread across the plane, while linear interpolations (black triangles) are concentrated in particular regions of the plane. The set of latent variables obtained with (circles, different colors represent different videos), on the other hand, seem to have a different behavior. In some cases, small clusters of ’s belonging to the same video are located in different parts of the isomap, which lead us to conclude that the video generator learns to “jump” across the manifold defined by whenever it is necessary.
5 Conclusions and future directions
We introduced a novel approach for unsupervised generation of temporal data using GANs. The method aims to break the problem into frame and sequence generation, and to solve them separately, thus making both tasks easier. Evaluation is performed on unsupervised video generation, and generated video samples presented good quality and diversity per frame as well as temporal coherence. This approach further provides indications regarding the structure of the implicit manifold parametrized by GANs, something that still remains elusive in the literature. Visualization of latent variables after dimensionality reduction via isomap indicates that the videos manifold is not continuous, as latent representations corresponding to visually similar frames are not necessarily close in the isomap. This work opens directions of future research as a general scheme for generative modeling of time-series. As such, we intend to apply the same approach to different domains. Further exploiting the video generation setting and including other objective video quality metrics is another target of future investigation.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
-  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.
-  Chris Donahue, Julian McAuley, and Miller Puckette, “Synthesizing audio with generative adversarial networks,” arXiv preprint arXiv:1802.04208, 2018.
-  Wilson Cai, Anish Doshi, and Rafael Valle, “Attacking speaker recognition with deep generative models,” arXiv preprint arXiv:1801.02384, 2018.
-  Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba, “Generating videos with scene dynamics,” in Advances In Neural Information Processing Systems, 2016, pp. 613–621.
Masaki Saito, Eiichi Matsumoto, and Shunta Saito,
“Temporal generative adversarial nets with singular value clipping,”in
IEEE International Conference on Computer Vision (ICCV), 2017, vol. 2, p. 5.
-  Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz, “Mocogan: Decomposing motion and content for video generation,” arXiv preprint arXiv:1707.04993, 2017.
-  Katsunori Ohnishi, Shohei Yamamoto, Yoshitaka Ushiku, and Tatsuya Harada, “Hierarchical video generation from orthogonal information: Optical flow and texture,” arXiv preprint arXiv:1711.09618, 2017.
-  Ian Goodfellow, “Nips 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160, 2016.
-  Behnam Neyshabur, Srinadh Bhojanapalli, and Ayan Chakrabarti, “Stabilizing gan training with multiple random projections,” arXiv preprint arXiv:1705.07831, 2017.
-  David Berthelot, Thomas Schumm, and Luke Metz, “Began: boundary equilibrium generative adversarial networks,” arXiv preprint arXiv:1703.10717, 2017.
-  Martin Arjovsky, Soumith Chintala, and Léon Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
-  Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh, “Pacgan: The power of two samples in generative adversarial networks,” arXiv preprint arXiv:1712.04086, 2017.
-  Abhay Yadav, Sohil Shah, Zheng Xu, David Jacobs, and Tom Goldstein, “Stabilizing adversarial nets with prediction methods,” 2018.
-  Joshua B Tenenbaum, Vin De Silva, and John C Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov,
“Unsupervised learning of video representations using lstms,”in
International conference on machine learning, 2015, pp. 843–852.
-  “Pytorch,” https://pytorch.org/docs/stable/index.html, 2018.
-  “FrameGAN Github Repository,” https://github.com/belaalb/frameGAN, 2018.