1 Introduction
Exponential progress in the capabilities of computational hardware, paired with a relentless effort towards greater insights and better methods, has pushed the field of machine learning from relative obscurity into the mainstream. Progress in the field has translated to improvements in various capabilities, such as classification of images (Krizhevsky et al., 2012), machine translation (Vaswani et al., 2017) and superhuman gameplaying agents (Mnih et al., 2013; Silver et al., 2017)
, among others. However, the application of machine learning technology has been largely constrained to situations where large amounts of supervision is available, such as in image classification or machine translation, or where highly accurate simulations of the environment are available to the learning agent, such as in gameplaying agents. An appealing alternative to supervised learning is to utilize large unlabeled datasets, combined with predictive generative models. In order for a complex generative model to be able to effectively predict future events, it must build up an internal representation of the world. For example, a predictive generative model that can predict future frames in a video would need to model complex realworld phenomena, such as physical interactions. This provides an appealing mechanism for building models that have a rich understanding of the physical world, without any labeled examples. Videos of realworld interactions are plentiful and readily available, and a large generative model can be trained on large unlabeled datasets containing many video sequences, thereby learning about a wide range of realworld phenoma. Such a model could be useful for learning representations for further downstream tasks
(Mathieu et al., 2016), or could even be used directly in applications where predicting the future enables effective decision making and control, such as robotics (Finn et al., 2016). A central challenge in video prediction is that the future is highly uncertain: a short sequence of observations of the present can imply many possible futures. Although a number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally (as in the case of pixellevel autoregressive models), or do not directly optimize the likelihood of the data.In this paper, we study the problem of stochastic prediction, focusing specifically on the case of conditional video prediction: synthesizing raw RGB video frames conditioned on a short context of past observations (Ranzato et al., 2014; Srivastava et al., 2015; Vondrick et al., 2015; Xingjian et al., 2015; Boots et al., 2014). Specifically, we propose a new class of video prediction models that can provide exact likelihoods, generate diverse stochastic futures, and accurately synthesize realistic and highquality video frames. The main idea behind our approach is to extend flowbased generative models (Dinh et al., 2014, 2016)
into the setting of conditional video prediction. Although methods based on variational autoencoders
(Babaeizadeh et al., 2017; Denton & Fergus, 2018; Lee et al., 2018), and pixellevel autoregressive models (Hochreiter & Schmidhuber, 1997; Graves, 2013; van den Oord et al., 2016b, c; Van Den Oord et al., 2016) have previously been studied for stochastic predictive generation, flowbased models have received comparatively much less attention, and to our knowledge have been applied only to generation of nontemporal data, such as images (Kingma & Dhariwal, 2018), and to audio sequences (Prenger et al., 2018). Conditional generation of videos presents its own unique challenges: the high dimensionality of video sequences makes them difficult to model as individual datapoints. Instead, we learn a latent dynamical system model that predicts future values of the flow model’s latent state. This induces Markovian dynamics on the latent state of the system, replacing the standard unconditional prior distribution. We further describe a practically applicable architecture for flowbased video prediction models, inspired by the Glow model for image generation (Kingma & Dhariwal, 2018), which we call VideoFlow.Our empirical results show that VideoFlow achieves results that are competitive with the stateoftheart in stochastic video prediction on the actionfree BAIR dataset, with quantitative results that rival the best VAEbased models. VideoFlow also produces excellent qualitative results, and avoids many of the common artifacts of models that use pixellevel meansquarederror for training (e.g., blurry predictions), without the challenges associated with training adversarial models. Compared to models based on pixellevel autoregressive prediction, VideoFlow achieves substantially faster testtime image synthesis ^{1}^{1}1We generate 64x64 videos of 20 frames in less than 3.5 seconds on a NVIDIA P100 GPU., making it much more practical for applications that require realtime prediction, such as robotic control (Finn & Levine, 2017). Finally, since VideoFlow directly optimizes the likelihood of training videos, without relying on a variational lower bound, we can evaluate its performance directly in terms of likelihood values.
2 Related Work
Early work on prediction of future video frames focused on deterministic predictive models (Ranzato et al., 2014; Srivastava et al., 2015; Vondrick et al., 2015; Xingjian et al., 2015; Boots et al., 2014). Much of this research on deterministic models focused on architectural changes, such as incorporating pixel transformations (Finn et al., 2016; De Brabandere et al., 2016; Liu et al., 2017) and predictive coding architectures (Lotter et al., 2017), as well as different generation objectives (Mathieu et al., 2016; Vondrick & Torralba, 2017; Walker et al., 2015). With models that can successfully model many deterministic environments, the next key challenge is to address stochastic environments by building models that can effectively reason over uncertain futures. Realworld videos are always somewhat stochastic, either due to events that are inherently random, or events that are caused by unobserved or partially observable factors, such as offscreen events, humans and animals with unknown intentions, and objects with unknown physical properties. In such cases, since deterministic models can only generate one future, these models either disregard potential futures or produce blurry predictions that are the superposition or averages of possible futures.
A variety of methods have sought to overcome this challenge by incorporating stochasticity, via three types of approaches: models based on variational autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014), generative adversarial networks (Goodfellow et al., 2014), and autoregressive models (Hochreiter & Schmidhuber, 1997; Graves, 2013; van den Oord et al., 2016b, c; Van Den Oord et al., 2016). Among these models, techniques based on variational autoencoders have been explored most widely (Babaeizadeh et al., 2017; Denton & Fergus, 2018; Lee et al., 2018)
. These models use latent random variables to represent stochastic events. They are trained by maximizing the evidence lower bound using an inference network, which estimates the posterior distribution over these latent variables and is typically conditioned on the current or future frames, such that the whole model resembles a sequencelevel autoencoder. Unlike our proposed method, none of these models maximize the loglikelihood directly, since they rely on optimizing the evidence lower bound.
To our knowledge, the only prior class of video prediction models that directly maximize the loglikelihood of the data are autoregressive models (Hochreiter & Schmidhuber, 1997; Graves, 2013; van den Oord et al., 2016b, c; Van Den Oord et al., 2016)
, which can be applied to model the joint distribution of raw video pixels by means of an autoregressive model that generates the video one pixel at a time
(Kalchbrenner et al., 2017). However, synthesis with such models is typically inherently sequential, making synthesis substantially inefficient on modern parallel hardware. Prior work has aimed to speed up training and synthesis with such autoregressive models (Reed et al., 2017; Ramachandran et al., 2017). However, Babaeizadeh et al. (2017) show that the predictions from these models are sharp but noisy and that the proposed VAE model produces substantially better predictions, especially for longer horizons. In contrast to autoregressive models, we find that our proposed method exhibits faster sampling, while still directly optimizing the loglikelihood and producing highquality longterm predictions.3 Preliminaries: FlowBased Generative Models
Flowbased generative models (Dinh et al., 2014, 2016) have received comparatively little attention in the research community. However, these models have a unique set of advantages: exact latentvariable inference, exact loglikelihood evaluation, and efficiency in terms of both inference and synthesis. The basic principles behind flowbased generative models were first described by Deco & Brauer (1995), but were rediscovered and more fully developed in a modern context by Dinh et al. (2014) as Nonlinear Independent Component Estimation (NICE), with further refinements and extensions proposed by Dinh et al. (2016) (RealNVP). To our knowledge, in the domain of image generation, prior work has only applied such models to generate static images (Kingma & Dhariwal, 2018) or sound (Prenger et al., 2018), while we propose a dynamicsenabled normalizing flow model in our work. Here, we first summarize the foundations of modern normalizing flow models.
Let be our dataset of i.i.d. observations of a random variable with an unknown true distribution . Our data consist of 8bit videos, with each dimension rescaled to the domain . We add a small amount of uniform noise to the data, , matching its discretization level (Dinh et al., 2016; Kingma & Dhariwal, 2018). Let be the resulting empirical distribution corresponding to this scaling and addition of noise. Note that additive noise is required to prevent from having infinite densities at the datapoints, which can result in illbehaved optimization of the loglikelihood; it also allows us to recast maximization of the loglikelihood as minimization of a KL divergence.
Let be our model of the data with parameters . Maximization of the loglikelihood w.r.t. , is then equivalent to minimization the KL divergence w.r.t. the parameters : . This objective measures the expected perdatapoint compression cost in nats or bits (depending on the base); see Dinh et al. (2016).
In flowbased generative models (Dinh et al., 2014, 2016), we model the data as if it was first generated from a latent space , then transformed to through an invertible transformation:
(1)  
(2) 
where is the latent variable and
has a simple, tractable density, such as a spherical multivariate Gaussian distribution:
. The function is invertible, also called bijective, such that given a datapoint , latentvariable inference is done by . We will omit subscript from and .We focus on functions where (and, likewise, ) is composed of a sequence of invertible transformations: . Under the change of variables of Eq. (2
), the probability density function (PDF) of the model given a datapoint can be written as:
(3)  
(4) 
where and . The scalar value is the absolute value of the determinant of the Jacobian matrix , also called the Jacobian determinant. This value is the change in logdensity when going from to under transformation . While computation of the Jacobian determinant is expensive in the general case, its value can be surprisingly simple to compute for certain choices of transformations, as explored in prior work (Deco & Brauer, 1995; Dinh et al., 2014, 2016; Rezende & Mohamed, 2015; Kingma et al., 2016; Kingma & Dhariwal, 2018). The basic idea used in this work, is to choose transformations whose Jacobian is a triangular matrix, diagonal matrix or a permutation matrix. For permutation matrices, the Jacobian determinant is one. For triangular and diagonal Jacobian matrices , the determinant is simply the product of diagonal terms, such that:
(5) 
where takes the elementwise logarithm, and is the th element on the diagonal of matrix .
4 Proposed Architecture
We propose a generative flow for video, extending the recently proposed Glow (Kingma & Dhariwal, 2018) and RealNVP (Dinh et al., 2016) architectures.
In our model, we break up the latent space into separate latent variables per timestep: . The latent variable at timestep is an invertible transformation of a corresponding frame of video: . Furthermore, like in (Dinh et al., 2016; Kingma & Dhariwal, 2018), we use a multiscale architecture (Fig. 1): the latent variable is composed of a stack of multiple levels: where each level encodes information about frame at a particular scale: , one component per level.
4.1 Autoregressive latent dynamics model
As in equation (1), we need to choose a form of latent prior . We use the following autoregressive factorization for the latent prior:
(6) 
where denotes the latent variables of frames prior to the th timestep: . We specify the conditional prior as having the following factorization:
(7) 
where is the set of latent variables at previous timesteps and at the same level , while is the set of latent variables at the same timestep and at higher levels. See figure 2 for a graphical illustration of the dependencies.
4.2 Invertible neural networks
As explained in the section 3, the observed variables are modeled as an invertible function of the latent variable . We let each individual frame in the video be modeled as function (a normalizing flow) of the set of corresponding latent variable: ; see figure 1 for an illustration. For this flow , we use the multiscale Glow architecture as introduced in (Kingma & Dhariwal, 2018), which builds upon the multiscale flow introduced in (Dinh et al., 2016). We refer to (Dinh et al., 2016; Kingma & Dhariwal, 2018) for more details.
Note that in our architecture we have chosen to let the prior , as described in eq. (6), model temporal dependencies in the data, while constraining the flow to act on separate frames of video. We have experimented with using 3D convolutional flows, but found this to be computationally overly expensive compared to an autoregressive prior; in terms of both number of operations and number of parameters.
A separate concern is that of temporal border effects. Due to memory limits, we found it only feasible to perform SGD with a small number of sequential frames per gradient step. In case of 3D convolutions, this would make the temporal dimension considerably smaller during training than during synthesis; this would change the model’s input distribution between training and synthesis, which often leads to various artifacts. This temporal border effect is not present in our architecture. Using 2D convolutions in our flow , and with autoregressive priors, allows us to synthesize arbitrarily long sequences with introducing temporal border effects.
4.3 Residual Network Architecture
Here we’ll describe the architecture for the residual network that maps to (). Let
be the tensor representing
after the split operation between levels in the multiscale architecture. We apply a convolution over and concatenate this across channels to each latent from the previous timestep and the samelevel independently. In this way, we obtain . We transform these values into () via a stack of residual blocks. We obtain a reduction in parameter count by sharing parameters across every 2 timesteps via 3D convolutions in our residual blocks.Each 3D residual block consists of three layers. The first layer has a filter size of 2x3x3 with 512 output channels followed by a ReLU activation. The second layer has two
convolutions via the Gated Activation Unit (Van Den Oord et al., 2016; van den Oord et al., 2016a). The third layer has a filter size of with the number of output channels determined by the level. This block is replicated three times in parallel, with dilation rates 1, 2 and 4, after which the results of each block, in addition to the input of the residual block, are summed.The first two layers are initialized using a Gaussian distribution and the last layer is initialized to zeroes. In that way, the residual network behaves as an identity network during initialization allowing stable optimization. After applying a sequence of residual blocks, we use the last temporal activation that should capture all context. We apply a final convolution to this activation to obtain (). We then add to to a temporal skip connection to output . This way, the network learns to predict the change in latent variables for a given level.
5 Quantitative Experiments
We evaluate the performance of VideoFlow on a toy Stochastic Movement Dataset (Babaeizadeh et al., 2017) and the BAIR robot pushing dataset (Ebert et al., 2017)
. We provide ablations of the key components of our model to quantify their effect. Finally, we provide quantitative comparisons to previous stateoftheart stochastic video generation baselines. The full set of hyperparameters of the VideoFlow model is described in the supplementary material.
Dataset  Bitsperpixel 
BAIR action free  1.87 
5.1 Video modelling with the Stochastic Movement Dataset
We use VideoFlow to model the Stochastic Movement Dataset used in prior work (Babaeizadeh et al., 2017). In this dataset, the first frame of every video consists of a shape placed near the center of a 64x64x3 resolution gray background with its type, size and color randomly sampled. The shape then randomly moves in one of eight directions, (up, down, left, right, upleft, upright, downleft, downright) with constant speed. Babaeizadeh et al. (2017) show that conditioned on the first frame, their latent variable stochastic model is able to generate all plausible trajectories of the shape while a deterministic model just averages out all eight possible directions in pixel space.
Since the shape moves with a uniform speed, we should be able to model the position of the shape at the step using only the position of the shape at the step. More specifically, given the frame at , i.e if the shape is at the center, the model should learn a distribution over 8 positions to generate the frame at . Given a frame at any other the model should learn a deterministic position of the shape for . Using this insight, we extract random temporal patches of 2 frames from each video of 3 frames. We then use the VideoFlow model to maximize the loglikelihood of the second frame given the first, i.e the model looks back at just one frame. We observe that the bitsperpixel on the holdout set reduces to a very low values between and bitsperpixel across multiple hyperparameter runs. We then generate videos conditioned on the first frame with the shape at the center. On inspection of these videos, we observe that the model consistently predicts the future trajectory of the shape to be one of the eight random directions.
5.2 Video Modeling with the BAIR Dataset
We use the actionfree version of the BAIR robot pushing dataset (Ebert et al., 2017) that contain videos of a Sawyer robotic arm with resolution 64x64. In the absence of actions, the task of video generation is completely unsupervised with multiple plausible trajectories due to the partial observability of the environment and stochasticity of the robot actions.
For each video we extract the first 13 frames and take a random temporal patch of 4 frames due to memory constraints. Using Equation 6, we then train our VideoFlow model to maximize the loglikelihood of the 4th frame given the context of 3 previous frames; the residual network in section 4.3 looks back
frames. This stochastic objective gives is an unbiased estimator of the loglikelihood of frame 4 to 13, conditioned on the first three frames. We constrained the range to the first 13 frames in order to be compatible with the results with previous models of this dataset
(Babaeizadeh et al., 2017; Lee et al., 2018). We set apart 512 videos from the training set as a validation set on which hyperparameters are optimized.For evaluation, we use the first 3 frames as groundtruth conditioning frames. For each of the remaining 10 target frames, we compute the bitsperpixel given the window of 3 previous frames. We then average this across all the 10 target frames and the test set.
5.3 Ablation Studies
Through an ablation study, we experimentally evaluate the importance of the following components of our VideoFlow model: (1) the use of temporal skip connections, (2) the use Gated Activation Unit (GATU) instead of ReLUs in the residual network (section 4.3), and (3) the use of dilations in .
We start with a VideoFlow model with 256 channels in the coupling layer, 16 steps of flow and remove the components mentioned above to create our baseline. We use four different combinations of our components (described in Fig. 4) and keep the rest of the hyperparameters fixed across those combinations. For each combination we plot the mean bitsperpixel on the holdout BAIRaction free dataset over 300K training steps for both affine and additive coupling in Figure 4. For both the coupling layers, we observe that the VideoFlow model with all the components provide a significant boost in bitsperpixel over our baseline.
We also note that other combinations—dilated convolutions + GATU (C) and dilated convolutions + the temporal skip connection —improve over the baseline. Finally, we experienced that increasing the receptive field in using dilated convolutions alone in the absence of the temporal skip connection or the GATU makes training highly unstable.
5.4 Comparison with stochastic videogeneration baselines
We compare against two stateoftheart stochastic video generation models, SAVPVAE (Lee et al., 2018) and SV2P (Babaeizadeh et al., 2017). We use the implementation of these models in the opensource Tensor2Tensor (Vaswani et al., 2018) library. We train these baseline video models to predict ten frames given three conditioning frames, ensuring that all the video models have seen a total of 13 frames during training.
Both these models use variations of temporal VAEs which optimize a lower bound on the loglikelihood and hence are not directly comparable to our model. To make a quantitative comparison with the baselines, we follow the metrics proposed in prior work (Babaeizadeh et al., 2017; Lee et al., 2018). For a given set of conditioning frames in the BAIR actionfree testset, we generate 100 videos from each of the stochastic models. We then compute the closest of these generated videos to the ground truth according to three different metrics, PSNR (Peak Signal to Noise Ratio), SSIM (Structural Similarity) (Wang et al., 2004)
and cosine similarity using features obtained from a pretrained VGG network
(Dosovitskiy & Brox, 2016; Johnson et al., 2016). ^{2}^{2}2Our baselines are also tuned using this VGGbased cosine similarity metric on a search grid available in the appendix. We report our findings in Figure 5. This metric helps us understand if the true future lies in the set of all plausible futures according to the video model (and the implicit embedding space of each of the metrics).In prior work, (Lee et al., 2018) and (Babaeizadeh et al., 2017)
do not train a stochastic decoder to learn the variance in pixel space, rather they use a deterministic decoder and effectively treat this variance as a hyperparameter. They search for the variance on a grid of extremely small values on a logscale using a two stage training procedure. They show that this greatly improves training stability and removes pixellevel noise during generation.
We can remove pixellevel noise in our VideoFlow model resulting in higher quality videos at the cost of diversity by sampling videos at a lower temperature, analogous to the lowtemperature procedure in (Kingma & Dhariwal, 2018). For a network trained with additive coupling layers, we can sample the frame from with a temperature
simply by scaling the standard deviation of the latent gaussian distribution
by a factor of T. To achieve a balance between quality and diversity, we tune the temperature using the maximum VGG similarity across 100 video samples with the groundtruth as a metric^{3}^{3}3The temperature was tuned on a linear scale between 0.1 and 1.0 on the validation set.. We report results with a temperature of 1.0 and the optimal temperature in Figure 5.Our model with optimal temperature performs as well as the SAVPVAE model on the VGGbased similarity metrics, which correlate well with human perception. (Zhang et al., 2018). Our model with temperature outperforms the SV2P model. PSNR and SSIM are explicitly pixellevel metrics, which SAVPVAE incorporates as part of its optimization objective. VideoFlow on the otherhand models the conditional probability of the distribution of frames, hence as expected it underperforms SAVPVAE on PSNR and SSIM.
We also computed the variational bound of the bitsperpixel loss, via importance sampling, from the posteriors for the SAVPVAE and SV2P models. Neither of these models estimate a pixellevel variance, which is required for estimating the loss; we estimated the optimal pixellevel variance for both models. We obtain high values of bitsperpixel, larger than 6, for these models. We attribute this to the optimization objective of these models: they do not optimize the loglikelihood directly due to the presence of a term in their objective.
5.5 Generation time
For our model used to demonstrate qualitative results using additive coupling layers, sampling 20 frames of 64x64 resolution takes less than 3.5 seconds on an NVIDIA P100 GPU. To our knowledge, the fastest autoregressive model for video (Reed et al., 2017) that models loglikelihood directly generates a frame every 3 seconds^{4}^{4}4An important caveat is that code and hardware differences make these numbers not directly comparable..
5.6 Outofsequence detection
We use our trained VideoFlow model, conditioned on 3 frames as explained in Section 5.2, to detect the plausibility of a temporally inconsistent frame to occur in the immediate future. To do this, we condition the model on the first three frames of a testset video to obtain a distribution over its 4th frame . We then compute the likelihood of the frame of the same video to occur as the 4th timestep using this distribution. i.e, for . We average the corresponding bitsperpixel values across the test set and report our findings in Figure 6. We find that our model assigns a monotonically decreasing loglikelihood to frames that are more far out in the future and hence less likely to occur in the 4th timestep.
Secondly, for the distribution obtained from each testset video as explained above, we then randomly sample another video from the testset and choose it’s 4th frame which we describe as . We then compute the mean bitsperpixel obtained by across the test set. We repeat this experiment 1000 times and observe the mean across the 1000 trials to be 8.876
with a standard error of
0.002. Our results reflect the intuition that the frames from a different video should be less likely to occur in the 4th timestep than the same video but from a different timestep.6 Qualitative Experiments
We demonstrate qualitative results by generating videos conditioned on input frames and interpolations in latent space for both datasets. The qualitative results can be viewed at
https://sites.google.com/corp/view/videoflow/home. In the generated videos, a border of blue represents the conditioning frame, while a border of red represents the generated frames.6.1 Effect of temperature
We study the effect of temperature on the quality of generated videos in Figure 7. For each temperature, we sample 100 videos from the model. We then compute the max cosine similarity across these 100 videos based on features obtained from a pretrained VGG network with the ground truth as described in Section 5.4. We display the worst and best videos according to this metric. On inspection, we observe that even our “worst” videos across temperatures according to this metric are temporally cohesive and the robot arm looks sharp and realistic. We believe that though these videos are of high quality and are physically plausible, they are far from the ground truth, which itself represents just one plausible future in VGG feature space.
At lower temperatures, the arm exhibits slow motion with the background objects remaining static and clear while at higher temperatures, the arm moves much more rapidly, with the background objects becoming much noisier. We obtain a tradeoff between these two properties at a temperature of 0.5 via our qualitative experiments.
6.2 Longer predictions
We generate 100 frames into the future using our model trained on 13 frames with a temperature of 0.5. We display our results in Figure 8. On the top, even 100 frames into the future, the generated frames remain in the image manifold maintaining temporal consistency.
We additionally display a failure mode on the bottom. In the presence of occlusions, the arm remains supersharp but the background objects become noisier and blurrier. We hypothesize that this can be due to following reason. Our VideoFlow model has a bijection between the and meaning that the latent state cannot store information other than that present in the frame . This, in combination with the Markovian assumption in our latent dynamics means that the model can forget objects if they have been occluded for a few frames. In future work, we would address this drawback by incorporating longer memory in our VideoFlow model; for example by parameterizing
as a recurrent neural networks instead of residual networks in our autoregressive prior (eq.
9). Training on larger temporal patches could also potentially be made feasible by using more memoryefficient backpropagation algorithms for invertible neural networks, as initially explored by
(Gomez et al., 2017).6.3 Likelihood vs Quality
We show correlation between training progression (measured in bits per pixel) and quality of the generated videos in Figure 9. We display the videos generated by conditioning on frames from the test set for three different values of bitsperpixel on the testset. As we approach lower bitsperpixel, our VideoFlow model learns to model the structure of the arm with high quality as well as its motion resulting in high quality video.
6.4 Latent space interpolation
BAIR robot pushing dataset: We encode the first input frame and the last target frame into the latent space using our trained VideoFlow encoder and perform interpolations. We find that the motion of the arm is interpolated in a temporally cohesive fashion between the initial and final position. Further, we use the multilevel latent representation to interpolate representations at a particular level while keeping the representations at other levels fixed. We find that the bottom level interpolates the motion of background objects which are at a smaller scale while the top level interpolates the arm motion.
Stochastic Movement Dataset: We encode two different shapes with their type fixed but a different size and color into the latent space. We observe that the size of the shape gets smoothly interpolated. During training, we sample the colors of the shapes from a uniform discrete distribution which is reflected in our experiments. We observe that all the colors in the interpolated space lie in the set of colors in the training set.
7 Code for reproducing results
Our code to reproduce the experimental results is available in the publicly available Tensor2Tensor repository
8 Conclusion and Discussion
We describe a practically applicable architecture for flowbased video prediction models, inspired by the Glow model for image generation (Kingma & Dhariwal, 2018), which we call VideoFlow. We introduce a latent dynamical system model that predicts future values of the flow model’s latent state replacing the standard unconditional prior distribution. Our empirical results show that VideoFlow achieves results that are competitive with the stateoftheart VAE models in stochastic video prediction. Finally, our model optimizes loglikelihood directly making it easy to evaluate while achieving faster synthesis compared to pixellevel autoregressive video models, making our model suitable for practical purposes. In future work, we plan to incorporate memory in VideoFlow to model arbitrary longrange dependencies and apply the model to challenging downstream tasks.
References
 Babaeizadeh et al. (2017) Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., and Levine, S. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
 Boots et al. (2014) Boots, B., Byravan, A., and Fox, D. Learning predictive models of a depth camera & manipulator from raw execution traces. In International Conference on Robotics and Automation (ICRA), 2014.
 De Brabandere et al. (2016) De Brabandere, B., Jia, X., Tuytelaars, T., and Van Gool, L. Dynamic filter networks. In Neural Information Processing Systems (NIPS), 2016.
 Deco & Brauer (1995) Deco, G. and Brauer, W. Higher order statistical decorrelation without information loss. Advances in Neural Information Processing Systems, pp. 247–254, 1995.
 Denton & Fergus (2018) Denton, E. and Fergus, R. Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687, 2018.
 Dinh et al. (2014) Dinh, L., Krueger, D., and Bengio, Y. Nice: nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 Dinh et al. (2016) Dinh, L., SohlDickstein, J., and Bengio, S. Density estimation using Real NVP. arXiv preprint arXiv:1605.08803, 2016.
 Dosovitskiy & Brox (2016) Dosovitskiy, A. and Brox, T. Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pp. 658–666, 2016.
 Ebert et al. (2017) Ebert, F., Finn, C., Lee, A. X., and Levine, S. Selfsupervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268, 2017.
 Finn & Levine (2017) Finn, C. and Levine, S. Deep visual foresight for planning robot motion. In International Conference on Robotics and Automation (ICRA), 2017.
 Finn et al. (2016) Finn, C., Goodfellow, I., and Levine, S. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems, 2016.
 Gomez et al. (2017) Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. The reversible residual network: Backpropagation without storing activations. In Advances in Neural Information Processing Systems, pp. 2211–2221, 2017.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
 Graves (2013) Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long ShortTerm Memory. Neural computation, 9(8):1735–1780, 1997.

Johnson et al. (2016)
Johnson, J., Alahi, A., and FeiFei, L.
Perceptual losses for realtime style transfer and superresolution.
InEuropean Conference on Computer Vision
, pp. 694–711. Springer, 2016.  Kalchbrenner et al. (2017) Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., and Kavukcuoglu, K. Video pixel networks. International Conference on Machine Learning (ICML), 2017.
 Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10236–10245, 2018.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational Bayes. Proceedings of the 2nd International Conference on Learning Representations, 2013.
 Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751, 2016.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 1106–1114, 2012.
 Lee et al. (2018) Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
 Liu et al. (2017) Liu, Z., Yeh, R., Tang, X., Liu, Y., and Agarwala, A. Video frame synthesis using deep voxel flow. International Conference on Computer Vision (ICCV), 2017.
 Lotter et al. (2017) Lotter, W., Kreiman, G., and Cox, D. Deep predictive coding networks for video prediction and unsupervised learning. International Conference on Learning Representations (ICLR), 2017.
 Mathieu et al. (2016) Mathieu, M., Couprie, C., and LeCun, Y. Deep multiscale video prediction beyond mean square error. International Conference on Learning Representations (ICLR), 2016.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Prenger et al. (2018) Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A flowbased generative network for speech synthesis. CoRR, abs/1811.00002, 2018. URL http://arxiv.org/abs/1811.00002.
 Ramachandran et al. (2017) Ramachandran, P., Paine, T. L., Khorrami, P., Babaeizadeh, M., Chang, S., Zhang, Y., HasegawaJohnson, M. A., Campbell, R. H., and Huang, T. S. Fast generation for convolutional autoregressive models. arXiv preprint arXiv:1704.06001, 2017.
 Ranzato et al. (2014) Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., and Chopra, S. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
 Reed et al. (2017) Reed, S., Oord, A. v. d., Kalchbrenner, N., Colmenarejo, S. G., Wang, Z., Belov, D., and de Freitas, N. Parallel multiscale autoregressive density estimation. arXiv preprint arXiv:1703.03664, 2017.
 Rezende & Mohamed (2015) Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1530–1538, 2015.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pp. 1278–1286, 2014.
 Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Srivastava et al. (2015) Srivastava, N., Mansimov, E., and Salakhudinov, R. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning, 2015.
 Van Den Oord et al. (2016) Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 van den Oord et al. (2016a) van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016a.
 van den Oord et al. (2016b) van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016b.
 van den Oord et al. (2016c) van den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., and Kavukcuoglu, K. Conditional image generation with PixelCNN decoders. arXiv preprint arXiv:1606.05328, 2016c.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
 Vaswani et al. (2018) Vaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez, A. N., Gouws, S., Jones, L., Kaiser, Ł., Kalchbrenner, N., Parmar, N., et al. Tensor2tensor for neural machine translation. arXiv preprint arXiv:1803.07416, 2018.

Vondrick & Torralba (2017)
Vondrick, C. and Torralba, A.
Generating the future with adversarial transformers.
In
Computer Vision and Pattern Recognition (CVPR)
, 2017.  Vondrick et al. (2015) Vondrick, C., Pirsiavash, H., and Torralba, A. Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023, 2015.
 Walker et al. (2015) Walker, J., Gupta, A., and Hebert, M. Dense optical flow prediction from a static image. In International Conference on Computer Vision (ICCV), 2015.
 Wang et al. (2004) Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 2004.
 Xingjian et al. (2015) Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., and Woo, W.c. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, 2015.

Zhang et al. (2018)
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O.
The unreasonable effectiveness of deep features as a perceptual metric.
arXiv preprint, 2018.
Appendix A VideoFlow  BAIR Hyperparameters
a.1 Quantitative  Bitsperpixel
To report bitsperpixel we use the following set of hyperparameters. We use a learning rate schedule of linear warmup for the first 10000 steps and apply a lineardecay schedule for the last 150000 steps.
Hyperparameter  Value 

Flow levels  3 
Flow steps per level  24 
Coupling  Affine 
Number of coupling layer channels  512 
Optimier  Adam 
Batch size  40 
Learning rate  3e4 
Number of 3D residual blocks  5 
Number of 3D residual channels  256 
Training steps  600K 
a.2 Qualitative Experiments
For all qualitative experiments and quantitative comparisons with the baselines, we used the following sets of hyperparameters.
Hyperparameter  Value 

Flow levels  3 
Flow steps per level  24 
Coupling  Additive 
Number of coupling layer channels  392 
Optimier  Adam 
Batch size  40 
Learning rate  3e4 
Number of 3D residual blocks  5 
Number of 3D residual channels  256 
Training steps  500K 
Appendix B Hyperparameter grid for the baseline video models.
We train all our baseline models for 300K steps using the Adam optimizer. Our models were tuned using the maximum VGG cosine similarity metric with the groundtruth across 100 decodes.
SAVPVAE and SV2P: We use three values of latent loss multiplier 1e3, 1e4 and 1e5. For the SAVPVAE model, we additionally apply linear decay on the learning rate for the last 100K steps.
SAVPGAN: We tune the gan loss multiplier and the learning rate on a logscale from 1e2 to 1e4 and 1e3 to 1e5 respectively.
Appendix C Correlation between VGG perceptual similarity and bitsperpixel
We plot correlation between cosine similarity using a pretrained VGG network and bitsperpixel using our trained VideoFlow model. We compare as done in Section 5.6 and the VGG cosine similarity between and for . We report our results for every video in the test set in Figure 6. We notice a weak correlation between VGG perceptual metrics and bitsperpixel with a correlation factor of .
Comments
There are no comments yet.