DeepAI
Log In Sign Up

HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator

09/15/2022
by   Younggyo Seo, et al.
0

Video prediction is an important yet challenging problem; burdened with the tasks of generating future frames and learning environment dynamics. Recently, autoregressive latent video models have proved to be a powerful video prediction tool, by separating the video prediction into two sub-problems: pre-training an image generator model, followed by learning an autoregressive prediction model in the latent space of the image generator. However, successfully generating high-fidelity and high-resolution videos has yet to be seen. In this work, we investigate how to train an autoregressive latent video prediction model capable of predicting high-fidelity future frames with minimal modification to existing models, and produce high-resolution (256x256) videos. Specifically, we scale up prior models by employing a high-fidelity image generator (VQ-GAN) with a causal transformer model, and introduce additional techniques of top-k sampling and data augmentation to further improve video prediction quality. Despite the simplicity, the proposed method achieves competitive performance to state-of-the-art approaches on standard video prediction benchmarks with fewer parameters, and enables high-resolution video prediction on complex and large-scale datasets. Videos are available at https://sites.google.com/view/harp-videos/home.

READ FULL TEXT VIEW PDF

page 1

page 5

page 8

06/06/2019

Scaling Autoregressive Video Models

Due to the statistical complexity of video, the high degree of inherent ...
03/02/2021

Predicting Video with VQVAE

In recent years, the task of video prediction-forecasting future video g...
03/06/2021

Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction

A video prediction model that generalizes to diverse scenes would enable...
04/30/2021

A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Image and video synthesis are closely related areas aiming at generating...
11/05/2019

High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks

Predicting future video frames is extremely challenging, as there are ma...
02/08/2021

Colorization Transformer

We present the Colorization Transformer, a novel approach for diverse hi...

1 Introduction

Figure 1: Selcted video sample generated by HARP on RoboNet (Dasari et al., 2019).

Video prediction can enable agents to learn useful representations for predicting the future consequences of the decisions they make, which is crucial for solving the tasks that require long-term planning, including robotic manipulation (Finn and Levine, 2017; Kalashnikov et al., 2018) and autonomous driving (Xu et al., 2017). Despite the recent advances in improving the quality of video prediction (Finn et al., 2016; Babaeizadeh et al., 2018; Denton and Fergus, 2018; Lee et al., 2018; Weissenborn et al., 2020; Babaeizadeh et al., 2021), learning an accurate video prediction model remains notoriously difficult problem and requires a lot of computing resources, especially when the inputs are video sequences with high-resolution (Villegas et al., 2019; Clark et al., 2019; Luc et al., 2020). This is because the video prediction model should excel at both tasks of generating high-fidelity images and learning the dynamics of environments, though each task itself is already a very challenging problem.

Recently, autoregressive latent video prediction methods (Rakhimov et al., 2021; Yan et al., 2021, 2022) have been proposed to improve the efficiency of video prediction, by separating video prediction into two sub-problems: first pre-training an image generator (e.g., VQ-VAE; Oord et al. 2017), and then learning the autoregressive prediction model (Weissenborn et al., 2020; Chen et al., 2020) in the latent space of the pre-trained image generator. However, the prior works are limited in that they only consider relatively low-resolution videos (up to pixels) for demonstrating the efficiency of the approach; it is questionable that such experiments can fully demonstrate the benefit of operating in the latent space of image generator instead of pixel-channel space.

In this paper, we present High-fidelity AutoRegressive latent video Prediction (HARP), which scales up the previous autoregressive latent video prediction methods for high-fidelity video prediction. The main principle for the design of HARP is simplicity: we improve the video prediction quality with minimal modification to existing methods. First, for image generation, we employ a high-fidelity image generator, i.e.,vector-quantized generative adversarial network (VQ-GAN; Esser et al. 2021). This improves video prediction by enabling high-fidelity image generation (up to pixels) on various video datasets. Then a causal transformer model (Chen et al., 2020), which operates on top of discrete latent codes, is trained to predict the discrete codes from VQ-GAN, and autoregressive predictions made by the transformer model are decoded into future frames at inference time.

We highlight the main contributions of this paper below:

  • We show that our autoregressive latent video prediction model, HARP, can predict high-resolution ( pixels) frames on robotics dataset (i.e., Meta-World (Yu et al., 2020)) and large-scale real-world robotics dataset (i.e., RoboNet (Dasari et al., 2019)).

  • We show that HARP can leverage the image generator pre-trained on ImageNet for training a high-resolution video prediction model on complex, large-scale Kinetics-600 dataset

    (Carreira et al., 2018).

  • HARP achieves competitive or superior performance to prior state-of-the-art video prediction models on widely-used BAIR Robot Pushing (Ebert et al., 2017) and KITTI driving (Geiger et al., 2013) video prediction benchmarks.

2 Related work

Video prediction.

Video prediction aims to predict the future frames conditioned on images (Michalski et al., 2014; Ranzato et al., 2014; Srivastava et al., 2015; Vondrick et al., 2016; Lotter et al., 2017), texts (Wu et al., 2021b), and actions (Oh et al., 2015; Finn et al., 2016), which would be useful for several applications, e.g., model-based RL (Hafner et al., 2019; Kaiser et al., 2020; Hafner et al., 2021; Rybkin et al., 2021; Seo et al., 2022a, b), and simulator development (Kim et al., 2020, 2021). Various video prediction models have been proposed with different approaches, including generative adversarial networks (GANs; Goodfellow et al. 2014) known to generate high-fidelity images by introducing adversarial discriminators that also considers temporal or motion information (Aigner and Körner, 2018; Jang et al., 2018; Kwon and Park, 2019; Clark et al., 2019; Luc et al., 2020; Skorokhodov et al., 2022; Yu et al., 2022), latent video prediction models that operates on the latent space (Babaeizadeh et al., 2018; Denton and Fergus, 2018; Lee et al., 2018; Villegas et al., 2019; Wu et al., 2021a; Babaeizadeh et al., 2021), and autoregressive video prediction models that operates on pixel space by predicting the next pixels in an autoregressive way (Kalchbrenner et al., 2017; Reed et al., 2017; Weissenborn et al., 2020).

Autoregressive latent video prediction.

Most closely related to our work are autoregressive latent video prediction models that separate the video prediction problem into image generation and dynamics learning. Walker et al. (2021) proposed to learn a hierarchical VQ-VAE (Razavi et al., 2019) that extracts multi-scale hierarchical latents then train SNAIL blocks (Chen et al., 2018) that predict hierarchical latent codes, enabling high-fidelity video prediction. However, this involves a complicated training pipeline and a video-specific architecture, which limits its applicability. As simple alternatives, Rakhimov et al. (2021); Yan et al. (2021, 2022) proposed to first learn a VQ-VAE (Oord et al., 2017) and train a causal transformer with 3D self-attention (Weissenborn et al., 2020) and factorized 2D self-attention (Child et al., 2019), respectively. These approaches, however, are limited in that they only consider low-resolution videos. We instead present a simple high-resolution video prediction method that incorporates the strengths of both prior approaches.

Figure 2: Illustration of our approach. We first train a VQ-GAN model that encodes frames into discrete latent codes. Then the discrete codes are flattened following the raster scan order, and a causal transformer model is trained to predict the next discrete codes in an autoregressive manner.

3 Preliminaries

We aim to learn a video prediction model that predicts the future frames conditioned on the first frames of a video , where is the frame at timestep . Optionally, one can also consider conditioning the prediction model on actions that the agents would take.

3.1 Autoregressive video prediction model

Autoregressive video prediction model (Weissenborn et al., 2020) approximates the distribution of a video in a pixel-channel space. Given a video

, the joint distribution over pixels conditioned on the first

frames is modelled as the product of channel intensities and all pixels except pixels of conditioning frames:

(1)

where is a raster-scan ordering over all pixels from the video (we refer to Weissenborn et al. (2020) for more details), is all pixels before , is -th channel intensity of the pixel , and is all channel intensities before .

3.2 Vector quantized variational autoencoder

VQ-VAE (Oord et al., 2017) consists of an encoder that compresses images into discrete representations, and a decoder that reconstructs images from these discrete representations. Formally, given an image , the encoder encodes into a feature map consisting of a series of latent vectors , where is a raster-scan ordering of the feature map of size . Then is quantized to discrete representations based on the distance of latent vectors to the prototype vectors in a codebook as follows:

(2)

where is the set . Then the decoder learns to reconstruct from discrete representations . The VQ-VAE is trained by minimizing the following objective:

(3)

where the operator refers to a stop-gradient operator, is a reconstruction loss for learning representations useful for reconstructing images, is a codebook loss to bring codebook representations closer to corresponding encoder outputs , and is a commitment loss weighted by to prevent encoder outputs from fluctuating frequently between different representations.

3.3 Vector quantized generative adversarial network

VQ-GAN (Esser et al., 2021) is a variant of VQ-VAE that (a) replaces the in (3) by a perceptual loss (Zhang et al., 2018), and (b) introduces an adversarial training scheme where a patch-level discriminator (Isola et al., 2017) is trained to discriminate real and generated images by maximizing following loss:

(4)

Then, the objective is given as below:

(5)

where is an adaptive weight, is the gradient of the inputs to the last layer of the decoder , and is a scalar introduced for numerical stability.

4 Method

We present HARP, a video prediction model capable of predicting high-fidelity future frames. Our method is designed to fully exploit the benefit of autoregressive latent video prediction model that separates the video prediction into image generation and dynamics learning. The full architecture of HARP is illustrated in Figure 2.

4.1 High-fidelity image generator

We utilize the VQ-GAN model (Esser et al., 2021) that has proven to be effective for high-resolution image generation as our image generator (see Section 3 for the formulation of VQ-GAN). Specifically, we first pre-train the image generator then freeze the model throughout training to improve the efficiency of learning video prediction models. The notable difference to a prior work that utilize 3D convolutions to temporally downsample the video for efficiency (Yan et al., 2021) is that our image generator operates on single images; hence our image generator solely focus on improving the quality of generated images. Importantly, this enables us to utilize the VQ-GAN model pre-trained on a wide range of natural images, e.g.,ImageNet, without training the image generator on the target datasets, which can significantly reduce the training cost of high-resolution video prediction model.

4.2 Autoregressive latent video prediction model

To leverage the VQ-GAN model for video prediction, we utilize the autoregressive latent video prediction architecture that operates on top of the discrete codes. Specifically, we extract the discrete codes using the pre-trained VQ-GAN, where is the discrete code extracted from the frame as in (3.2). Then, instead of modelling the distribution of video in the pixel-channel space as in (1), we learn the distribution of the video in the discrete latent representation space:

(6)

where is the total number of codes from . Due to its simplicity, we utilize the causal transformer architecture (Yan et al., 2021)

where the output logits from input codes are trained to predict the next discrete codes.

4.3 Additional techniques

Top-k sampling.

To improve the video prediction quality of latent autoregressive models whose outputs are sampled from the probability distribution over a large number of discrete codes, we utilize the top-

sampling (Fan et al., 2018) that randomly samples the output from the top-probable discrete codes. By preventing the model from sampling rare discrete codes from the long-tail of a probability distribution and predicting future frames conditioned on such discrete codes, we find that top- sampling improves video prediction quality, especially given that the number of discrete encodings required for future prediction is very large, e.g., 2,560 on RoboNet (Dasari et al., 2019) up to 6,400 on KITTI dataset (Geiger et al., 2013) in our experimental setup.

Data augmentation.

We also investigate how data augmentation can be useful for improving the performance of autoregressive latent video prediction models. Since the image generator model is not trained with augmentation, we utilize a weak augmentation to avoid the instability coming from aggressive transformation of input frames, i.e., translation augmentation that moves the input images by pixels along the X or Y direction.

5 Experiments

We design our experiments to investigate the following:

  • Can HARP predict high-resolution future frames (up to pixels) on various video datasets with different characteristics?

  • How does HARP compare to state-of-the-art methods with large end-to-end networks on standard video prediction benchmarks in terms of quantitative evaluation?

  • How does the proposed techniques affect the performance of HARP?

[RoboNet] [Kinetics-600]

Figure 3: future frames predicted by HARP trained on (a) RoboNet (Dasari et al., 2019) and (b) Kinetics-600 (Carreira et al., 2018) datasets.

[BAIR Robot Pushing] Methodbbb Baselines are SVG (Villegas et al., 2019), GHVAE (Wu et al., 2021a), FitVid (Babaeizadeh et al., 2021), LVT (Rakhimov et al., 2021), SAVP (Lee et al., 2018), DVD-GAN-FP (Clark et al., 2019), VideoGPT (Yan et al., 2021), TrIVD-GAN-FP (Luc et al., 2020), and Video Transformer (Weissenborn et al., 2020). Params FVD () LVT 50M 125.8 SAVP 53M 116.4 DVD-GAN-FP 109.8 VideoGPT 82M 103.3 TrIVD-GAN-FP 103.3 Video Transformer 373M 94.0 FitVid 302M 93.6 HARP (ours) 89M 99.3 [KITTI] Method Params FVD () LPIPS () SVG 298M 1217.3 0.327 GHVAE 599M 552.9 0.286 FitVid 302M 884.5 0.217 HARP (ours) 89M 482.9 0.191 Not available

Table 1: Quantitative evaluation on (a) BAIR Robot Pushing (Ebert et al., 2017) and (b) KITTI driving dataset (Geiger et al., 2013). We observe that HARP can achieve competitive performance to state-of-the-art methods with large end-to-end networks on these benchmarks.

5.1 High-resolution video prediction

Implementation.

We utilize up to 8 Nvidia 2080Ti GPU and 20 CPU cores for training each model. For training VQ-GAN (Esser et al., 2021), we first train the model without a discriminator loss , and then continue the training with the loss following the suggestion of the authors. For all experiments, VQ-GAN downsamples each frame into latent codes, i.e., by a factor of 4 for frames of size frames, and 16 for frames of size . For training a transformer model, the VQ-GAN model is frozen so that its parameters are not updated. We use Sparse Transformers (Child et al., 2019) as our transformer architecture to accelerate the training. For hyperparameterse, we use for sampling at inference time.

Setup.

For all experiments, VQ-GAN downsamples each frame into latent codes, i.e., by a factor of 4 for frames of size frames, and 16 for frames of size

. For training a transformer model, the VQ-GAN model is frozen so that its parameters are not updated. As for hyperparameter, we use

for sampling at inference time, but no data augmentation for high-resolution video prediction experiments. We investigate how our model works on large-scale real-world RoboNet dataset (Dasari et al., 2019) consisting of more than 15 million frames, and Kinetics-600 dataset consisting of more than 400,000 videos, which require a large amount of computing resources for training even on resolution (Babaeizadeh et al., 2021; Clark et al., 2019). For RoboNet experiments, we first train a VQ-GAN model, and then train a 12-layer causal transformer model that predicts future 10 frames conditioned on first two frames and future ten actions. For Kinetics-600 dataset, to avoid the prohibitively expensive training cost of high-resolution video prediction models on this dataset and fully exploit the benefit of employing a high-fidelity image generator, we utilize the ImageNet pre-trained VQ-GAN model. As we train the transformer model only for autoregressive prediction, this enables us to train a video prediction model in a very efficient manner.

Results.

First, we provide the predicted frames on the held-out test video of RoboNet dataset in Figure 3, where the model predicts the high-resolution future frames where a robot arm is moving around various objects of different colors and shapes. Furthermore, Figure 3 shows that Kinetics-600 pre-trained model can also predict future frames on the test natural videoscccVideos with CC-BY license: Figure 3 top and bottom, which demonstrates that leveraging the large image generator pre-trained on a wide range of natural images can be a promising recipe for efficient video prediction on large-scale video datasets.

5.2 Comparative evaluation on standard benchmarks

Setup.

For quantitative evaluation, we first consider the BAIR robot pushing dataset (Ebert et al., 2017) consisting of roughly 40k training and 256 test videos. Following the setup in prior work (Yan et al., 2021), we predict 15 future frames conditioned on one frame. We also evaluate our method on KITTI driving dataset (Geiger et al., 2013), where the training and test datasets are split by following the setup in Villegas et al. (2019). Specifically, the test dataset consists of 148 video clips constructed by extracting 30-frame clips and skipping every 5 frames, and the model is trained to predict future ten frames conditioned on five frames and evaluated to predict future 25 frames conditioned on five frames. For hyperparameters, We use = 10 for both datasets and data augmentation with

= 4 is only applied to KITTI as there was no sign of overfitting on BAIR dataset. For evaluation metrics, we use LPIPS

(Zhang et al., 2018) and FVD (Unterthiner et al., 2018), computed using 100 future videos for each ground-truth test video, then reports the best score over 100 videos for LPIPS, and all videos for FVD, following Babaeizadeh et al. (2021); Villegas et al. (2019).

[Effects of ] Dataset FVD () BAIR No top- 104.4 100 103.6 10 99.3 KITTI No top- 578.1 100 557.7 10 482.9 [Effects of layers] Dataset Layers FVD () BAIR 6 111.8 12 99.3 KITTI 6 520.1 12 482.9 [Effects of ] Dataset FVD () KITTI 0 980.1 2 497.0 4 482.9 8 523.4
Table 2: FVD scores of HARP with varying (a) the number of codes to use for top- sampling, (b) number of layers, and (c) magnitude of data augmentation.

Results.

Table 1 shows the performances of our method and baselines on test sets of BAIR Robot Pushing and KITTI driving dataset. We observe that our model achieves competitive or superior performance to state-of-the-art methods with large end-to-end networks, e.g., HARP outperforms FitVid with 302M parameters on KITTI driving dataset. Our model successfully extrapolates to unseen number of future frames (i.e., 25) instead of 10 future frames used in training on KITTI dataset. This implies that transformer-based video prediction models can also predict arbitrary number of frames at inference time. In the case of BAIR dataset, HARP achieves the similar performance of FitVid with 302M parameters, even though our method only requires 89M parameters.

Analysis.

We investigate how the top- sampling, number of layers, and magnitude of data augmentation affect the performance. Table LABEL:tbl:analysis_k shows that smaller leads to better performance, implying that the proposed top- sampling is effective for improving the performance by discarding rare discrete codes that might degrade the prediction quality at inference time. As shown in Table LABEL:tbl:analysis_layers, we observe that more layers leads to better performance on BAIR dataset, which implies our model can be further improved by scaling up the networks. Finally, we find that (i) data augmentation on KITTI dataset is important for achieving strong performance, similar to the observation of Babaeizadeh et al. (2021), and (ii) too aggressive augmentation leads to worse performance.

6 Discussion

In this work, we present HARP that employs a high-fidelity image generator for predicting high-resolution future frames, and achieves competitive performance to state-of-the-art video prediction methods with large end-to-end networks. We also demonstrate that HARP can leverage the image generator pre-trained on a wide range of natural images for video prediction, similar to the approach in the context of video synthesis (Tian et al., 2021). We hope this work inspires more investigation into leveraging recently developed pre-trained image generators (Oord et al., 2017; Chen et al., 2020; Esser et al., 2021) for high-fidelity video prediction.

[RoboNet] [Kinetics-600]

Figure 4: Failure cases in our experiments. (a) Interaction with the objects is ignored. (b) The model repeats the first frame while a person is moving right in the ground-truth frames.

Finally, we report the failure cases of video prediction with HARP and discuss the possible extensions to resolve the issue. A common failure case for video prediction on RoboNet dataset is ignoring the interaction between a robot arm and objects. For example, in Figure 4, our model ignores the objects and only predicts the movement of a robot arm. On the other hand, common failure case for Kinetics-600 is a degenerate video prediction, where a model just repeats the conditioning frame without predicting the future, as shown in Figure 4

. These failure cases might be resolved by training more larger networks similar to the observation in the field of natural language processing,

e.g.,GPT-3 (Brown et al., 2020), or might necessitate a new architecture for addressing the complexity of training autoregressive latent prediction models on video datasets.

7 Acknowledgements

We would like to thank Jongjin Park, Wilson Yan, and Sihyun Yu for helpful discussions. We also thank Cirrascale Cloud Servicesdddhttps://cirrascale.com

for providing compute resources. This work is supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0-02068, Artificial Intelligence Innovation Hub; No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)), Center for Human Compatible AI (CHAI), ONR N00014-21-1-2769, the Darpa RACER program, and the Hong Kong Centre for Logistics Robotics, BMW.

References

  • S. Aigner and M. Körner (2018) Futuregan: anticipating the future frames of video sequences using spatio-temporal 3d convolutions in progressively growing gans. arXiv preprint arXiv:1810.01325. Cited by: §2.
  • M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine (2018) Stochastic variational video prediction. In International Conference on Learning Representations, Cited by: §1, §2.
  • M. Babaeizadeh, M. T. Saffar, S. Nair, S. Levine, C. Finn, and D. Erhan (2021) FitVid: overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195. Cited by: §1, §2, §5.1, §5.2, §5.2, footnote b.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §6.
  • J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman (2018) A short note about kinetics-600. arXiv preprint arXiv:1808.01340. Cited by: item , Figure 3.
  • M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever (2020) Generative pretraining from pixels. In

    International Conference on Machine Learning

    ,
    Cited by: §1, §1, §6.
  • X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel (2018) Pixelsnail: an improved autoregressive generative model. In International Conference on Machine Learning, Cited by: §2.
  • R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §2, §5.1.
  • A. Clark, J. Donahue, and K. Simonyan (2019) Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571. Cited by: §1, §2, §5.1, footnote b.
  • S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn (2019) Robonet: large-scale multi-robot learning. In Conference on Robot Learning, Cited by: Figure 1, item , §4.3, Figure 3, §5.1.
  • E. Denton and R. Fergus (2018) Stochastic video generation with a learned prior. In International Conference on Machine Learning, Cited by: §1, §2.
  • F. Ebert, C. Finn, A. X. Lee, and S. Levine (2017) Self-supervised visual planning with temporal skip connections.. In Conference on Robot Learning, Cited by: item , §5.2, Table 1.
  • P. Esser, R. Rombach, and B. Ommer (2021) Taming transformers for high-resolution image synthesis. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §1, §3.3, §4.1, §5.1, §6.
  • A. Fan, M. Lewis, and Y. Dauphin (2018) Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Cited by: §4.3.
  • C. Finn, I. Goodfellow, and S. Levine (2016) Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, Cited by: §1, §2.
  • C. Finn and S. Levine (2017) Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), Cited by: §1.
  • A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research. Cited by: item , §4.3, §5.2, Table 1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Cited by: §2.
  • D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2019) Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, Cited by: §2.
  • D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba (2021) Mastering atari with discrete world models. In International Conference on Learning Representations, Cited by: §2.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §3.3.
  • Y. Jang, G. Kim, and Y. Song (2018) Video prediction with appearance and motion conditions. In International Conference on Machine Learning, Cited by: §2.
  • L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. (2020)

    Model-based reinforcement learning for atari

    .
    In International Conference on Learning Representations, Cited by: §2.
  • D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, Cited by: §1.
  • N. Kalchbrenner, A. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu (2017) Video pixel networks. In International Conference on Machine Learning, Cited by: §2.
  • S. W. Kim, J. Philion, A. Torralba, and S. Fidler (2021) DriveGAN: towards a controllable high-quality neural simulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • S. W. Kim, Y. Zhou, J. Philion, A. Torralba, and S. Fidler (2020) Learning to simulate dynamic environments with gamegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • Y. Kwon and M. Park (2019) Predicting future frames using retrospective cycle gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine (2018) Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523. Cited by: §1, §2, footnote b.
  • W. Lotter, G. Kreiman, and D. Cox (2017) Deep predictive coding networks for video prediction and unsupervised learning. In International Conference on Learning Representations, Cited by: §2.
  • P. Luc, A. Clark, S. Dieleman, D. d. L. Casas, Y. Doron, A. Cassirer, and K. Simonyan (2020) Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035. Cited by: §1, §2, footnote b.
  • V. Michalski, R. Memisevic, and K. Konda (2014) Modeling deep temporal dependencies with recurrent grammar cells. In Advances in Neural Information Processing Systems, Cited by: §2.
  • J. Oh, X. Guo, H. Lee, R. Lewis, and S. Singh (2015) Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems, Cited by: §2.
  • A. v. d. Oord, O. Vinyals, and K. Kavukcuoglu (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, Cited by: §1, §2, §3.2, §6.
  • R. Rakhimov, D. Volkhonskiy, A. Artemov, D. Zorin, and E. Burnaev (2021) Latent video transformer. In International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Cited by: §1, §2, footnote b.
  • M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra (2014) Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604. Cited by: §2.
  • A. Razavi, A. van den Oord, and O. Vinyals (2019) Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems, Cited by: §2.
  • S. Reed, A. Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. Freitas (2017)

    Parallel multiscale autoregressive density estimation

    .
    In International Conference on Machine Learning, Cited by: §2.
  • O. Rybkin, C. Zhu, A. Nagabandi, K. Daniilidis, I. Mordatch, and S. Levine (2021) Model-based reinforcement learning via latent-space collocation. In International Conference on Machine Learning, Cited by: §2.
  • Y. Seo, D. Hafner, H. Liu, F. Liu, S. James, K. Lee, and P. Abbeel (2022a) Masked world models for visual control. arXiv preprint arXiv:2206.14244. Cited by: §2.
  • Y. Seo, K. Lee, S. L. James, and P. Abbeel (2022b) Reinforcement learning with action-free pre-training from videos. In International Conference on Machine Learning, Cited by: §2.
  • I. Skorokhodov, S. Tulyakov, and M. Elhoseiny (2022) Stylegan-v: a continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In International Conference on Machine Learning, Cited by: §2.
  • Y. Tian, J. Ren, M. Chai, K. Olszewski, X. Peng, D. N. Metaxas, and S. Tulyakov (2021) A good image generator is what you need for high-resolution video synthesis. In International Conference on Learning Representations, Cited by: §6.
  • T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly (2018) Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717. Cited by: §5.2.
  • R. Villegas, A. Pathak, H. Kannan, D. Erhan, Q. V. Le, and H. Lee (2019)

    High fidelity video prediction with large stochastic recurrent neural networks

    .
    Advances in Neural Information Processing Systems. Cited by: §1, §2, §5.2, footnote b.
  • C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In Advances in Neural Information Processing Systems, Cited by: §2.
  • J. Walker, A. Razavi, and A. v. d. Oord (2021) Predicting video with vqvae. arXiv preprint arXiv:2103.01950. Cited by: §2.
  • D. Weissenborn, O. Täckström, and J. Uszkoreit (2020) Scaling autoregressive video models. In International Conference on Learning Representations, Cited by: §1, §1, §2, §2, §3.1, footnote b.
  • B. Wu, S. Nair, R. Martin-Martin, L. Fei-Fei, and C. Finn (2021a)

    Greedy hierarchical variational autoencoders for large-scale video prediction

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2, footnote b.
  • C. Wu, L. Huang, Q. Zhang, B. Li, L. Ji, F. Yang, G. Sapiro, and N. Duan (2021b) GODIVA: generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806. Cited by: §2.
  • H. Xu, Y. Gao, F. Yu, and T. Darrell (2017) End-to-end learning of driving models from large-scale video datasets. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1.
  • W. Yan, R. Okumura, S. James, and P. Abbeel (2022) Patch-based object-centric transformers for efficient video generation. arXiv preprint arXiv:2206.04003. Cited by: §1, §2.
  • W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas (2021) VideoGPT: video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157. Cited by: §1, §2, §4.1, §4.2, §5.2, footnote b.
  • S. Yu, J. Tack, S. Mo, H. Kim, J. Kim, J. Ha, and J. Shin (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In International Conference on Learning Representations, Cited by: §2.
  • T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine (2020) Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, Cited by: item .
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §3.3, §5.2.