Learning to navigate image manifolds induced by generative adversarial networks for unsupervised video generation

01/23/2019 ∙ by Isabela Albuquerque, et al. ∙ 0

In this work, we introduce a two-step framework for generative modeling of temporal data. Specifically, the generative adversarial networks (GANs) setting is employed to generate synthetic scenes of moving objects. To do so, we propose a two-step training scheme within which: a generator of static frames is trained first. Afterwards, a recurrent model is trained with the goal of providing a sequence of inputs to the previously trained frames generator, thus yielding scenes which look natural. The adversarial setting is employed in both training steps. However, with the aim of avoiding known training instabilities in GANs, a multiple discriminator approach is used to train both models. Results in the studied video dataset indicate that, by employing such an approach, the recurrent part is able to learn how to coherently navigate the image manifold induced by the frames generator, thus yielding more natural-looking scenes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative adversarial networks (GANs) [1]

were recently introduced as an unsupervised approach to generative modeling, employing game-theoretic training schemes in order to learn a given probability density, implicitly defined by training data. Under this setting, two models are trained jointly. The generator tries to map low dimensional samples from some simple prior distribution to higher-dimensional structured data, while the discriminator, on the other hand, tries to determine whether samples are genuine or generator outputs. To date, state-of-the-art results have been obtained for GAN-based generative modeling of images

[2, 3] and audio, if image-like spectrogram representations are used [4, 5]. However, their applications in other domains, such as temporal or discrete data, remain open problems under active investigation. Here, we direct our focus to adversarially learned video modeling.

A common strategy in recent attempts on training GANs for natural scenes generation focuses on splitting the task into simpler parts. In [6], for instance, there are independent modules for foreground and background modeling. In both [7] and [8], motion and frame content are learned by different parts of the architecture designed specifically for each of those aspects. In turn, in [9] authors tackle the problem by conditioning generation on optical flows provided a priori. However, in all such cases, even though the model architectures are designed aiming to focus on different aspects of video generation, training is performed together, which might yield relevant training difficulties such as mode collapse and divergence [10]. This is due to the higher dimensionality of videos which also include a temporal component.

In this work, we further exploit the idea of splitting the video generation process into smaller and simpler components. Frame content and motion modeling are achieved by independent blocks within the complete model: a convolutional block responsible to map a low-dimensional vector into a frame, and a recurrent block intended to receive a fixed-dimension vector input and to output a sequence of vectors to be used as inputs into the convolutional block, thus yielding a sequence of frames. Moreover, a new training scheme is devised on top of the proposed setting to avoid common issues faced when training GANs. Each of the above mentioned components, i.e. the generative model of frames as well as the generative model of sequences of frames, are trained separately. More specifically, the multi-discriminator setting introduced in

[11] is used in both steps to further stabilize training and produce diverse generators.

Under the described setting, the frames generator can be seen as a parametric representation of the manifold of video frames, i.e. a mapping from a much lower-dimensional space to actual frames. This model is trained first and, as we obtain good samples in terms of quality and diversity, the sequence component is trained. The sequence generator, in turn, is a recurrent model trained with the goal of learning how to effectively traverse the manifold induced by the pre-trained frames generator in such a way that yields coherent frame sequences.

The remainder of this paper is organized as follows. In Section 2 we briefly review Generative Adversarial Networks training. In Section 3 we describe the proposed approach and provide experiments to validate it in Section 4. We provide conclusions and future research directions in Section 5.

2 Generative Adversarial Networks

GANs are generally composed of a discriminator model , where is the dimensionality of the input space, and a generator , where is the size of an input noise vector . receives a sample from the data distribution or a sample from the generator , . During training, its goal is to learn how to tell apart these two different types of inputs. The generator, on the other hand, aims at fooling the discriminator by learning how to produce samples as close to the data distribution as possible. GAN training was originally defined as a min-max game, but here we utilize the non-saturating game, as defined in [10]. According to this training scheme, the discriminator loss and the generator loss are respectively defined as

(1)
(2)

With the success of GANs and its popularization, deeper analyses have shown that these models may suffer from instability during training [12, 13] which can lead to lack of diversity and poor quality on the generated samples. In order to alleviate these issues, many GAN variations were proposed in the last few years [11, 14, 15]. One interesting approach proposed in [11] consists in using multiple discriminators where each one considers as input a low-dimensional randomly projected version of the original input. The authors empirically showed that this method yielded more stable training and provided more diverse and better quality generated samples.

3 Proposed Model and Training

The proposed method relies on two main components: (i) a convolutional frame generator , and (ii) a recurrent model for generating videos . The goal is to disentangle image quality and temporal coherence components of a video and letting each of the generative models individually focus on one of these two aspects. By doing so, the performance of the model relies on the capability of the frame generator to provide good and diverse images as well as on the sequence generator to be able to sequentially sample frames (i.e. navigate through the frames manifold induced by ) in a coherent order.

One of the main challenges in such an approach is to be able to train with enough diversity. Several methods have been proposed recently targeting mode dropping in the GAN setting [14]. In our experiments, we found the multiple-discriminators approach introduced in [11] to yield better stability during training, as well as higher sample quality and diversity. Training follows the usual steps, i.e. each discriminator is separately updated, but when updating the generator parameters, the average of discriminator losses is considered. Thus, instead of using (2) as the generator loss during training, we use (3) instead, namely:

(3)

where indicates the output of the -th discriminator and the total number of discriminators. Training of was performed with discriminators. An architecture similar to DCGAN [2] was employed.

is composed of three main building blocks: an encoding stack of dense layers responsible to map a noise vector into a sequence of high-dimensional vectors. This sequence is fed into a bi-directional recurrent block that computes a sequence of temporally dependent noise vectors which are then used to sample from . Finally, for the case of videos with length , the output is obtained by sampling times from the frames generator and ordering the samples to form the final sequence . The described framework is represented in Fig. 1. The encoder (trapezoid) is parametrized by fully-connected (FC) layers, and the recurrent model by a two-layer bi-directional LSTM.

Figure 1: Graphical representation of the video generator. Pink blocks represent the pre-trained frame generator.

The scheme proposed in [11] was also used to train the sequence generator. In this case, we utilized 16 discriminators which inputs are reduced-dimension random projections of each frame composing the video input. It is important to highlight that parameters are kept unchanged during the training of . The architectures used for the video generation GAN were: 1) Generator: FC; 2) Discriminator: similar to [2] but with 3D convolutions in the place of 2D in order to take into account the temporal dimension. Random projections were implemented as norm convolutions.

4 Experimental Results

In order to evaluate the proposed method, we performed two main experiments. First, we aim to show that our approach is able to generate videos with both frame quality and temporal coherence. For that, we train the frame generator using frames from the same videos used from training the videos generator. Overall, our goal is to investigate what the video generator is learning in terms of navigation throughout the implicit manifold parametrized by . To this end, we plot the 2-dimensional isomap [16] of generated latent variables by the video generator resulting from the first experiment. We thus built a training dataset composed of 100,000 samples from bouncing balls data [17]

. Each example consists of 30 frames-long videos with three balls bouncing. Randomly sampled frames from the same set of videos were used to train the frames generator in advance. RMSprop optimizer with learning rate equal to

and was employed to train and , respectively.

was trained for 50 epochs with mini-batches of size 64, while 15 epochs were used for

with mini-batches of size 8. Random seed was previously set to 10 before all experiments. A single NVIDIA GTX 1080Ti was used for training. A Pytorch

[18] implementation is available at Github [19].

4.1 Generating videos

In Fig. 2, we show samples randomly drawn from the frame generator. By visual inspection, we notice that, as desired, good quality and diversity were obtained. Using this model as , we train and show random samples in Fig. 3b. To provide a reference for comparison, we also show in Fig. 3a three randomly selected video samples drawn from the real data distribution. Each frame is plotted individually such that time increases from left to right. Visual inspection of generated sequences of frames indicates that both the quality of individual frames (as ensured by the frame generator) and temporal coherence were close to original samples. More specifically, we notice that most of the transitions between frames are as smooth as in the original data samples.

We further highlight that the generated video samples are diverse, which suggests the proposed training scheme is effective in avoiding strong mode collapse. Nonetheless, failure cases do still occur. For example, on the first row of Fig. 3b we observed an undesired non-smooth transition from the fourth to last to the third to last frames. We also noticed that the temporal dynamics of the videos shown on the second and third rows of Fig. 3b is very similar, even though a frame-wise comparison shows that the videos are not the same. We refer to this effect as partial mode-collapse and believe this could be mitigated by increasing the number of discriminators when training ; this is left for future study.

Figure 2: Random samples from the frame generator.

Moreover, time coherence was also studied in a case where and were trained using different datasets, namely bouncing balls with 1 and 3 balls, respectively. Notice that both one and three bouncing balls datasets have similar temporal dynamics and no further training to fine-tune after replacing was executed. Three samples from the new video generator are shown in Fig. 4, from which one can notice that, even though the dynamics is not perfectly preserved in all frame transitions, as in some cases the ball changes its trajectory without hitting a wall first, smooth transitions between frames are still maintained. This simple experiment indicates that the video generator is indeed able to, at some extent, independently learn the temporal dynamics without specifically focusing on the content of each frame.

Finally, we objectively assessed smoothness by measuring the mean-squared error (MSE) between consecutive frames. Average and standard deviation for 30 random samples drawn from models obtained using 3 and 1 bouncing balls are presented in Table

1. The same metrics are provided for real data and videos obtained using a random sequence of latent variables for comparison. For the 1-ball case, generated samples are as smooth as real videos. For the 3 ball cases, in turn, aforementioned eventual non-smooth transitions happen for sequences as long as 30 frames, as confirmed by the higher MSE, which is still much lower than random sequences.

4.2 Investigating what the frame generator is learning

We plot the isomap of the sequence of latent variables for 6 videos in order to investigate what the frames generator is learning. We included in this plot samples randomly drawn from the prior with the aim of verifying whether is simply learning how to sample from the prior without any further knowledge. Another hypothesis we wanted to investigate is whether

is learning to linearly interpolate latent variables. For that, we plot in the isomap two sequences of latent variables obtained by linearly interpolating two random samples from the prior. Results are shown in Fig. 

5.

(a) Samples from the training data.
(b) Samples of videos generated by the proposed approach.
Figure 3: Real data (a) and generated video samples (b). and were both trained using the three bouncing balls dataset. Time increases from left to right.
Figure 4: Video generator samples for trained with one bouncing ball and trained with three bouncing balls.
Mean Std. dev.
3 balls Real data 0.0222 0.0005
Proposed 0.0735 0.0057
0.2060 0.0061
1 ball Real data 0.0222 0.0005
Proposed 0.0227 0.0051
0.0766 0.0023
Table 1: MSE between consecutive frames.
Figure 5: Two-dimensional isomap obtained by plotting for 6 generated videos (circles, different colors represent different videos), samples from the prior (green crosses), and linear interpolations (black triangles).

By observing the obtained plot, we notice that samples from the prior (green crosses) are spread across the plane, while linear interpolations (black triangles) are concentrated in particular regions of the plane. The set of latent variables obtained with (circles, different colors represent different videos), on the other hand, seem to have a different behavior. In some cases, small clusters of ’s belonging to the same video are located in different parts of the isomap, which lead us to conclude that the video generator learns to “jump” across the manifold defined by whenever it is necessary.

5 Conclusions and future directions

We introduced a novel approach for unsupervised generation of temporal data using GANs. The method aims to break the problem into frame and sequence generation, and to solve them separately, thus making both tasks easier. Evaluation is performed on unsupervised video generation, and generated video samples presented good quality and diversity per frame as well as temporal coherence. This approach further provides indications regarding the structure of the implicit manifold parametrized by GANs, something that still remains elusive in the literature. Visualization of latent variables after dimensionality reduction via isomap indicates that the videos manifold is not continuous, as latent representations corresponding to visually similar frames are not necessarily close in the isomap. This work opens directions of future research as a general scheme for generative modeling of time-series. As such, we intend to apply the same approach to different domains. Further exploiting the video generation setting and including other objective video quality metrics is another target of future investigation.

References