Video (language) modeling: a baseline for generative models of natural videos

by   MarcAurelio Ranzato, et al.
NYU college

We propose a strong baseline model for unsupervised feature learning using video data. By learning to predict missing frames or extrapolate future frames from an input video sequence, the model discovers both spatial and temporal correlations which are useful to represent complex deformations and motion patterns. The models we propose are largely borrowed from the language modeling literature, and adapted to the vision domain by quantizing the space of image patches into a large dictionary. We demonstrate the approach on both a filling and a generation task. For the first time, we show that, after training on natural videos, such a model can predict non-trivial motions over short video sequences.


page 6

page 7

page 8

page 11

page 12

page 13

page 14

page 15


Predicting 3D Human Dynamics from Video

Given a video of a person in action, we can easily guess the 3D future m...

Learning Energy-based Spatial-Temporal Generative ConvNets for Dynamic Patterns

Video sequences contain rich dynamic patterns, such as dynamic texture p...

VideoFlow: A Flow-Based Generative Model for Video

Generative models that can model and predict sequences of future events ...

Semi-Parametric Video-Grounded Text Generation

Efficient video-language modeling should consider the computational cost...

Signs in time: Encoding human motion as a temporal image

The goal of this work is to recognise and localise short temporal signal...

Photo-Realistic Video Prediction on Natural Videos of Largely Changing Frames

Recent advances in deep learning have significantly improved performance...

Generative Models for Low-Rank Video Representation and Reconstruction

Finding compact representation of videos is an essential component in al...

Please sign up or login with your details

Forgot password? Click here to reset