Most of the existing models for video prediction employ a hybrid of convolutional and recurrent layers as the underlying architecture (wang2017predrnn; xingjian2015convolutional; lotter2016deep). Such architectural design enables the model to simultaneously exploit the ability of convolutional units to model spatial relationships and the potential of recurrent units to capture temporal dependencies. Despite their prevalence in the literature, classical video prediction architectures suffer from one major limitation. In dense prediction tasks such as video prediction, models are required to make pixel-wise predictions, which emphasizes the demand for the preservation of information through layers. Prior works attempt to address such demand through the extensive use of resolution-preserving blocks (wang2017predrnn; wang2018predrnn++; kalchbrenner2016video). Nevertheless, these resolution-preserving blocks are not guaranteed to preserve all the relevant information, and they greatly increase the memory consumption and computational cost of the models.
Recently, reversible architectures (dinh2014nice; gomez2017reversible; jacobsen2018revnet) have attracted attention due to their light memory demand and their information preserving property by design. However, the effectiveness of reversible models remains greatly unexplored in the video literature. In this extended abstract, we introduce a novel, conditionally reversible video prediction model, CrevNet, in the sense that when conditioned on previous hidden states, it can exactly reconstruct the input from its predictions. The contribution of this work can be summarized as follows:
We introduce a two-way autoencoder that uses the forward and backward passes of an invertible network as encoder and decoder. The volume-preserving two-way autoencoder not only greatly reduces the memory demand and computational cost, but also enjoys the theoretically guaranteed property of no information loss.
We propose the reversible predictive module (RPM), which extends the reversibility from spatial to temporal domain. RPM, together with the two-way autoencoder, provides a conditionally reversible architecture (CrevNet) for spatiotemporal learning.
We first outline the general pipeline of our method. Our CrevNet consists of two subnetworks, an autonencoder network with an encoder , decoder and a recurrent predictor bridging encoder and decoder. Let represent the frame in video , where , , and denote its width, height, and the number of channels. Given , the model predicts the next frame as follows:
During the multi-frame generation process without access to the ground truth frames, the model uses its previous predictions instead.
2.1 The Invertible Two-way Autoencoder
We propose a bijective two-way autoencoder based on the additive coupling layer introduced in NICE (dinh2014nice). We begin with describing the building block of the two-way autoencoder (Fig 1 a). Formally, the input is first reshaped and split channelwise into two groups, denoted as and . During the forward pass of each building block, one group, e.g. , passes through several convolutions and activations and is then added to another group, , like a residual block:
where is a composite non-linear operator consisting of convolutions and activations, and and are the updated and . Note that and can be simply recovered from and by the inverse computation (Fig 1 c) as follows:
Multiple building blocks are stacked in an alternating fashion between and to construct a two-way autoencoder, as shown in Fig 1 a. A series of the forward and inverse computations builds a one-to-one and onto, i.e. bijective , mapping between the input and features. Such invertibility ensures that there is no information loss during the feature extraction, which is presumably more favorable for video prediction since the model is expected to restore the future frames with fine-grained details. To enable the invertibility of the entire autoencoder, our two-way autoencoder uses a bijective downsampling, pixel shuffle layer (shi2016real), that changes the shape of feature from to . The resulting volume-preserving architecture can greatly reduce its memory consumption compared with the existing resolution-preserving methods.
We further argue that for generative tasks, e.g. video prediction, we can effectively utilize a single two-way autoencoder, and to use its forward and backward pass as the encoder and the decoder, respectively. The predicted frame is thus given by
where is the backward pass of . Our rationale is that, such setting would not only reduce the number of parameters in the model, but also encourage the model to explore the shared feature space between the inputs and the targets. As a result, our method does not require any form of information sharing, e.g. skip connection, between the encoder and decoder. In addition, our two-way autoencoder can enjoy a lower computational cost at the multi-frame prediction phase where the encoding pass is no longer needed and the predictor directly takes the output from previous timestep as input, since is an identity mapping .
2.2 Reversible Predictive Module
In this section, we describe the second part of our video prediction model, the predictor , which computes dependencies along both the space and time dimensions. Although the traditional stacked-ConvRNN layers architecture is the most straightforward choice of predictor, we find that it fails to establish a consistent temporal dependency when equipped with our two-way autoencoder through experiments. Therefore, we propose a novel reversible predictive module (RPM), which can be regarded as a recurrent extension of the two-way autoencoder. In the RPM, we substitute all standard convolutions with layers from the ConvRNN family (e.g. ConvLSTM or spatiotemporal LSTM) and introduce a soft attention (weighting gates) mechanism to form a weighted sum of the two groups instead of the direct addition. The main operations of RPM used in this paper are given as follows:
where and denote two groups of features at timestep , denote the hidden states of ConvRNN layer, is sigmoid activation, is the standard convolution operator and is the Hadamard product. The architecture of reversible predictive module is also shown in Fig 1 b. RPM adopts a similar architectural design as the two-way autoencoder to ensure a pixel-wise alignment between the input and the output, i.e. each position of features can be traced back to certain pixel, and thus make it compatible with our two-way autoencoder. It also mitigates the vanishing gradient issues across stacked layers since the coupling layer provides a nice property w.r.t. the Jacobian (dinh2014nice). In addition, the attention mechanism in the RPM enables the model to focus on objects in motion instead of background, which further improves the video prediction quality. Similarly, multiple RPMs alternate between the two groups to form a predictor. We call this predictor conditionally reversible since, given , we are able to reconstruct from if there are no numerical errors:
where is the inverse computation of the predictor . We name the video prediction model using two-way autoencoder as its backbone and RPMs as its predictor CrevNet. Another key factor of RPM is the choice of ConvRNN. In this paper, we mainly employ ConvLSTM (xingjian2015convolutional) and spatiotemporal LSTM (ST-LSTM, wang2017predrnn).
We evaluate our model on a more complicated real-world dataset, Traffic4cast, which collects the traffic statuses of 3 big cities over a year at a 5-minute interval. Traffic forecasting can be straightforwardly defined as video prediction task by its spatiotemporal nature. However, this dataset is quite challenging for the following reasons. (1). High resolution: The frame resolution of Traffic4cast is 495 436, which is the highest among all datasets. Existing resolution-preserving methods can hardly be adapted to this dataset since they all require extremely large memory and computation. Even if these models can be fitted in GPUs, they still do not have large enough receptive fields to capture the meaningful dynamics as vehicles can move up to 100 pixels between consecutive frames. (2). Complicated nonlinear dynamics: Valid data points only reside on the hidden roadmap of each city, which is not explicitly provided in this dataset. Moving vehicles on these curved roads along with tangled road conditions will produce very complex nonlinear behaviours. It also involves many unobservable conditions or random events like weather and car accidents.
Datasets and Setup: Each frame in Traffic4cast dataset is a 495 436 3 heatmap, where the last dimension records 3 traffic statuses representing volume, mean speed and major direction at given location. The general architecture of CrevNet is composed of a 36-layer two-way autoencoder and 8 RPMs. All variants of CrevNet are trained by using the Adam optimizer with a starting learning rate of to minimize MSE. We train each model to predict next 3 frames (the next 15 minutes) from 9 observations and evaluate prediction with MSE criterion. Moreover, we found another two tricks , finetuning on each city and finetuning on different timestep , that could further improve the overall results.
|CrevNet||single model||finetuning on each city||finetuning on each timestep||final model|
Results: The quantitative comparison is provided in Table 1. Unlike all previous state-of-the-art methods, CrevNet does not suffer from high memory consumption so that we were able to train our model in a single V100 GPU. The invertibility of two-way autoencoder preserves all necessary information for spatio-temporal learning and allows our model to generate sharp and reasonable predictions.