1 Introduction
Most of the existing models for video prediction employ a hybrid of convolutional and recurrent layers as the underlying architecture (wang2017predrnn; xingjian2015convolutional; lotter2016deep). Such architectural design enables the model to simultaneously exploit the ability of convolutional units to model spatial relationships and the potential of recurrent units to capture temporal dependencies. Despite their prevalence in the literature, classical video prediction architectures suffer from one major limitation. In dense prediction tasks such as video prediction, models are required to make pixelwise predictions, which emphasizes the demand for the preservation of information through layers. Prior works attempt to address such demand through the extensive use of resolutionpreserving blocks (wang2017predrnn; wang2018predrnn++; kalchbrenner2016video). Nevertheless, these resolutionpreserving blocks are not guaranteed to preserve all the relevant information, and they greatly increase the memory consumption and computational cost of the models.
Recently, reversible architectures (dinh2014nice; gomez2017reversible; jacobsen2018revnet) have attracted attention due to their light memory demand and their information preserving property by design. However, the effectiveness of reversible models remains greatly unexplored in the video literature. In this extended abstract, we introduce a novel, conditionally reversible video prediction model, CrevNet, in the sense that when conditioned on previous hidden states, it can exactly reconstruct the input from its predictions. The contribution of this work can be summarized as follows:

We introduce a twoway autoencoder that uses the forward and backward passes of an invertible network as encoder and decoder. The volumepreserving twoway autoencoder not only greatly reduces the memory demand and computational cost, but also enjoys the theoretically guaranteed property of no information loss.

We propose the reversible predictive module (RPM), which extends the reversibility from spatial to temporal domain. RPM, together with the twoway autoencoder, provides a conditionally reversible architecture (CrevNet) for spatiotemporal learning.
2 Approach
We first outline the general pipeline of our method. Our CrevNet consists of two subnetworks, an autonencoder network with an encoder , decoder and a recurrent predictor bridging encoder and decoder. Let represent the frame in video , where , , and denote its width, height, and the number of channels. Given , the model predicts the next frame as follows:
(1) 
During the multiframe generation process without access to the ground truth frames, the model uses its previous predictions instead.
2.1 The Invertible Twoway Autoencoder
We propose a bijective twoway autoencoder based on the additive coupling layer introduced in NICE (dinh2014nice). We begin with describing the building block of the twoway autoencoder (Fig 1 a). Formally, the input is first reshaped and split channelwise into two groups, denoted as and . During the forward pass of each building block, one group, e.g. , passes through several convolutions and activations and is then added to another group, , like a residual block:
(2) 
where is a composite nonlinear operator consisting of convolutions and activations, and and are the updated and . Note that and can be simply recovered from and by the inverse computation (Fig 1 c) as follows:
(3) 
Multiple building blocks are stacked in an alternating fashion between and to construct a twoway autoencoder, as shown in Fig 1 a. A series of the forward and inverse computations builds a onetoone and onto, i.e. bijective , mapping between the input and features. Such invertibility ensures that there is no information loss during the feature extraction, which is presumably more favorable for video prediction since the model is expected to restore the future frames with finegrained details. To enable the invertibility of the entire autoencoder, our twoway autoencoder uses a bijective downsampling, pixel shuffle layer (shi2016real), that changes the shape of feature from to . The resulting volumepreserving architecture can greatly reduce its memory consumption compared with the existing resolutionpreserving methods.
We further argue that for generative tasks, e.g. video prediction, we can effectively utilize a single twoway autoencoder, and to use its forward and backward pass as the encoder and the decoder, respectively. The predicted frame is thus given by
(4) 
where is the backward pass of . Our rationale is that, such setting would not only reduce the number of parameters in the model, but also encourage the model to explore the shared feature space between the inputs and the targets. As a result, our method does not require any form of information sharing, e.g. skip connection, between the encoder and decoder. In addition, our twoway autoencoder can enjoy a lower computational cost at the multiframe prediction phase where the encoding pass is no longer needed and the predictor directly takes the output from previous timestep as input, since is an identity mapping .
2.2 Reversible Predictive Module
In this section, we describe the second part of our video prediction model, the predictor , which computes dependencies along both the space and time dimensions. Although the traditional stackedConvRNN layers architecture is the most straightforward choice of predictor, we find that it fails to establish a consistent temporal dependency when equipped with our twoway autoencoder through experiments. Therefore, we propose a novel reversible predictive module (RPM), which can be regarded as a recurrent extension of the twoway autoencoder. In the RPM, we substitute all standard convolutions with layers from the ConvRNN family (e.g. ConvLSTM or spatiotemporal LSTM) and introduce a soft attention (weighting gates) mechanism to form a weighted sum of the two groups instead of the direct addition. The main operations of RPM used in this paper are given as follows:
ConvRNN  
Attention module  
Weighted sum 
where and denote two groups of features at timestep , denote the hidden states of ConvRNN layer, is sigmoid activation, is the standard convolution operator and is the Hadamard product. The architecture of reversible predictive module is also shown in Fig 1 b. RPM adopts a similar architectural design as the twoway autoencoder to ensure a pixelwise alignment between the input and the output, i.e. each position of features can be traced back to certain pixel, and thus make it compatible with our twoway autoencoder. It also mitigates the vanishing gradient issues across stacked layers since the coupling layer provides a nice property w.r.t. the Jacobian (dinh2014nice). In addition, the attention mechanism in the RPM enables the model to focus on objects in motion instead of background, which further improves the video prediction quality. Similarly, multiple RPMs alternate between the two groups to form a predictor. We call this predictor conditionally reversible since, given , we are able to reconstruct from if there are no numerical errors:
(5) 
where is the inverse computation of the predictor . We name the video prediction model using twoway autoencoder as its backbone and RPMs as its predictor CrevNet. Another key factor of RPM is the choice of ConvRNN. In this paper, we mainly employ ConvLSTM (xingjian2015convolutional) and spatiotemporal LSTM (STLSTM, wang2017predrnn).
3 Experiment
We evaluate our model on a more complicated realworld dataset, Traffic4cast, which collects the traffic statuses of 3 big cities over a year at a 5minute interval. Traffic forecasting can be straightforwardly defined as video prediction task by its spatiotemporal nature. However, this dataset is quite challenging for the following reasons. (1). High resolution: The frame resolution of Traffic4cast is 495 436, which is the highest among all datasets. Existing resolutionpreserving methods can hardly be adapted to this dataset since they all require extremely large memory and computation. Even if these models can be fitted in GPUs, they still do not have large enough receptive fields to capture the meaningful dynamics as vehicles can move up to 100 pixels between consecutive frames. (2). Complicated nonlinear dynamics: Valid data points only reside on the hidden roadmap of each city, which is not explicitly provided in this dataset. Moving vehicles on these curved roads along with tangled road conditions will produce very complex nonlinear behaviours. It also involves many unobservable conditions or random events like weather and car accidents.
Datasets and Setup: Each frame in Traffic4cast dataset is a 495 436 3 heatmap, where the last dimension records 3 traffic statuses representing volume, mean speed and major direction at given location. The general architecture of CrevNet is composed of a 36layer twoway autoencoder and 8 RPMs. All variants of CrevNet are trained by using the Adam optimizer with a starting learning rate of to minimize MSE. We train each model to predict next 3 frames (the next 15 minutes) from 9 observations and evaluate prediction with MSE criterion. Moreover, we found another two tricks , finetuning on each city and finetuning on different timestep , that could further improve the overall results.
CrevNet  single model  finetuning on each city  finetuning on each timestep  final model 
Perpixel MSE 
Results: The quantitative comparison is provided in Table 1. Unlike all previous stateoftheart methods, CrevNet does not suffer from high memory consumption so that we were able to train our model in a single V100 GPU. The invertibility of twoway autoencoder preserves all necessary information for spatiotemporal learning and allows our model to generate sharp and reasonable predictions.
Comments
There are no comments yet.