PreCNet: Next Frame Video Prediction Based on Predictive Coding

04/30/2020 ∙ by Zdenek Straka, et al. ∙ Czech Technical University in Prague 0

Predictive coding, currently a highly influential theory in neuroscience, has not been widely adopted in machine learning yet. In this work, we transform the seminal model of Rao and Ballard (1999) into a modern deep learning framework while remaining maximally faithful to the original schema. The resulting network we propose (PreCNet) is tested on a widely used next frame video prediction benchmark, which consists of images from an urban environment recorded from a car-mounted camera. On this benchmark (training: 41k images from KITTI dataset; testing: Caltech Pedestrian dataset), we achieve to our knowledge the best performance to date when measured with the Structural Similarity Index (SSIM). On two other common measures, MSE and PSNR, the model ranked third and fourth, respectively. Performance was further improved when a larger training set (2M images from BDD100k), pointing to the limitations of the KITTI training set. This work demonstrates that an architecture carefully based in a neuroscience model, without being explicitly tailored to the task at hand, can exhibit unprecedented performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

page 9

page 10

page 11

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Predicting near future is a crucial ability that every agent—human, animal, or robot—needs for survival in a dynamic and complex environment. Just for safely crossing a busy road, one needs to anticipate the future position of cars, pedestrians, as well as consequences of own actions. Machines are still lagging behind in this ability. For deployment in such environments, it is necessary to overcome this gap and develop efficient methods for foreseeing the future.

One candidate approach for predicting near future is predictive coding—a popular theory from neuroscience. The basic idea is that the brain is a predictive machine which anticipates incoming sensory inputs and only the prediction errors—unpredicted components—are used for the update of an internal representation. In addition, predictive coding tackles another important aspect of perception: how to efficiently encode redundant sensory inputs [huang2011predictive]. Rao and Ballard proposed and implemented a hierarchical architecture [rao1999predictive]—which we will refer to as predictive coding schema (see Section 3.1 for details)—that explains certain important properties of the visual cortex: the presence of oriented edge/bar detectors and extra-classical receptive field effects. This schema has influenced several works on human perception and neural information processing in the brain (see e.g., [spratling2008reconciling, stefanics2014visual, summerfield2009expectation, spratling2010predictive]; for reviews [friston2018does, huang2011predictive, clark2013whatever]).

In this work, our goal was to remain as faithful as possible to the predictive coding schema but cast it into a modern deep learning framework. We thoroughly analyze how the conceptual architecture is preserved. To demonstrate the performance, we chose a widely used benchmark—next frame video prediction—for the following reasons. First, large datasets of unlabeled sequences are available and this task bears direct application potential. Second, this task is an instance of unsupervised representation learning, which is currently actively researched (e.g., [mathieu2016deep]). Third, the complexity of the task can be scaled, for example by performing multiple frame prediction (frames are anticipated more steps ahead). On a popular next frame video prediction benchmark, our—strongly biologically grounded—network achieves state-of-the-art performance. In particular, on the widely used Structural Similarity Index (SSIM) performance metric [wang2004image], the model has achieved, to our knowledge, the best performance to date. In addition to commonly used training dataset (KITTI), we trained the model on a significantly bigger dataset which improved the performance even further.

This article is structured as follows. The Related Work section overviews models inspired by predictive coding and state-of-the-art methods for video prediction. This is followed by the Architecture section where we describe our model and compare it in detail with the original Rao and Ballard schema [rao1999predictive] and PredNet [lotter2017deep]—a model for next frame video prediction inspired by predictive coding. In Section 4, we detail the datasets, performance metrics, and our experiments in next and multiple frame video prediction. This is followed by Conclusion, Discussion, and Future Work. All code and trained models used in this work are available at [github-precnet].

2 Related Work

This section starts with a summary of predictive coding-inspired machine learning models. This is followed by an overview of state-of-the-art methods for video prediction.

2.1 Predictive coding models

In this section, we will focus on predictive coding-inspired machine learning models. A reader who is interested in the application in computational and theoretical neuroscience may find useful reviews [huang2011predictive, clark2013whatever, spratling2017review] and references [rao1999predictive, stefanics2014visual, summerfield2009expectation, spratling2010predictive, friston2005theory, spratling2008reconciling]. Predictive coding, a theory originating in neuroscience, is more a general schema (with certain properties) than a concrete model. Therefore, no “correct” model of predictive coding is available to date. In this work, by predictive coding, we will understand a well defined schema proposed by Rao and Ballard [rao1999predictive], which was also implemented as a computational model (see Section 3.1 for a description of the schema). This schema, which is highly influential in neuroscience, embodies crucial ideas of the predictive coding theory.

We will relate predictive coding-inspired machine learning models to the schema by Rao and Ballard and analyze which properties of the original are preserved and which are not. A detailed comparison of our deep neural network—intended to be as faithful as possible to the Rao and Ballard schema—will be presented in a separate Section 3.3.1. The models with static inputs and sequences will be presented separately.

2.1.1 Models with static inputs

An important part of predictive coding theory is the existence of prediction error neurons along with representational neurons (see 

[rao1999predictive, clark2013whatever]). Models [spratling2017hierarchical, han2018deep, wen2018deep] intended for object recognition in natural images have these two distinct neural populations, however, their training is not based on the prediction error minimization used in predictive coding. A generative model by Dora et al. [dora2018deep] for inferring causes underlying visual inputs does not follow the division into the error and representational neurons. However, the model is trained, in accordance with predictive coding, to minimize prediction errors. Same authors contributed to the model which extends the predictive coding approach to inference of latent visuo-tactile representations [struckmeier2019mupnet], used for place recognition of a biomimetic robot in a simulated environment.

2.1.2 Models with sequences as inputs

Ahmadi and Tani proposed the predictive-coding-inspired variational recurrent neural network 

[ahmadi2019novel]

(PV-RNN). The network works in a three stage processing cycle: (i) producing prediction, (ii) backpropagating the prediction errors across the network hierarchy, (iii) updating the internal states of the network to minimize future prediction errors. The network was used for synchronous imitation between two robots (joint angles and XYZ coordinates of a hand tip were used) and for extracting latent probabilistic structure from a binary output of a simple probabilistic finite state machine. Using the same three stage predictive coding processing cycle, Choi and Tani developed a predictive multiple spatio-temporal scales recurrent neural network 

[choi2018predictive]

(P-MSTRNN) for predicting binary image (36x36 pixels) sequences of human whole-body cyclic movement patterns. They also explored how the inferred internal (latent) states can be used for recognition of the movement patterns. Chalasani and Principe proposed a hierarchical linear dynamical model for feature extraction 

[chalasani2013deep]. The model took inspiration from predictive coding and used higher-level predictions for inference of lower-level predictions. However, all three models do not use the division into the error and representational neurons and consequently use a different schema than Rao and Ballard [rao1999predictive].

Lotter et al. proposed a predictive neural network (PredNet) for next-frame video prediction [lotter2017deep]. The network follows the division into error and representational neurons, but the processing schema is different to the one proposed by Rao and Ballard [rao1999predictive] and consequently to our model (see Section 3.3.2 for details).

2.2 Video prediction models

Video prediction is an important task in computer vision with a long history. A sequence of images is given and one or multiple following images are predicted (i.e., next and multiple frame video prediction task respectively). For our work, this provides a use case to benchmark the performance of our new neural network architecture. Therefore, we will restrict ourselves to briefly reviewing recent work with state-of-the-art performance and—wherever feasible—quantitatively compare the performance (see Section

4.3.4).

Most of the methods for video prediction produce blurred predictions. As blurriness is undesirable, Matthieu et al. [mathieu2016deep]

proposed a gradient difference loss function which is minimized when the gradient of the actual and predicted image is same. This loss function was then combined with adversarial learning. Byeon et al. 

[byeon2018contextvp] shown with their LSTM-based architecture that direct connection of each predicted pixel with the whole available past context led to decreasing prediction uncertainty on pixel level and therefore also reduced blurriness. Reda et al. [reda2018sdc] suggested that blurriness is amplified by using datasets with lack of large motion and small resolution. Therefore, they used video games (GTA-V and Battlefield-1) for generation of a large high-resolution dataset with large enough motion (testing was performed on natural sequences). The dataset was then used for training of a model which combines a kernel-based approach with usage of optical flow. Gao et al. [gao2019disentangling] proposed a model which performed generation of the future frames in two steps. Firstly, a flow predictor was used for warping the non-occluded regions. Then, the occluded regions were in-painted by a separate network. A method by Liu et al. [liu2017video]

, did not use optical flow directly, however, a deep network was trained to synthesize a future frame by flowing pixel values from the given video frames. This self-supervised method was also used for interpolation. Similarly to Gao et al. 

[gao2019disentangling], Hao et al. [hao2018controllable] proposed a two-stage architecture. However, the input of a network contained, in addition, sparse motion trajectories (automatically extracted for video prediction). First, the network produced a warped image that respected the given motion trajectories. In the second stage, occluded parts of the image were hallucinated and color change was compensated.

Villegas et al. [villegas2017learning]

introduced a model which first performed human pose detection and its future evolution. Then, the predicted human poses were used for future frames generation. Finn et al. 

[finn2016unsupervised] proposed a model that next to visual inputs takes actions of the robot into account. This action-conditioned model learned to anticipate pixel motions relatively to the previous frame.

A Conditionally Reversible Network (CrevNet) proposed by Yu et al. [Yu2020Efficient]

uses a bijective two-way autoencoder, based on convolutional networks, for encoding and decoding input frames. Feature maps obtained from the autoencoder are then used as an input to a ConvRNN based predictor. The transformed feature maps by the predictor are then decoded by the autoencoder and outputted as predicted frames. Besides future frame prediction, the features learned were used for object detection.

Some other state-of-the-art architectures are based on generative adversarial networks (GANs). The GAN by Kwon and Park [kwon2019predicting] can predict both future and past frames. The predictions in both directions are used for training. The GAN proposed by Liang et al. [liang2017dual] is trained to consistently predict future frames and pixel-wise flows using a dual learning mechanism. Vondrick et al. [vondrick2016generating] proposed GAN which unravels foreground from the background of the video scene.

Some of the mentioned works [liang2017dual, vondrick2016generating, lotter2017deep, Yu2020Efficient] also demonstrated that the representations which were learned during next frames video prediction training could be used for supervised learning tasks (e.g., human action recognition).

3 Architecture

This section starts with a description of the predictive coding schema which was proposed by Rao and Ballard [rao1999predictive]. This is followed by a detailed description of our model. The section is closed by a comparison of our model with related models: (i) a hierarchical network for predictive coding proposed by Rao and Ballard, (ii) PredNet – a deep network for next frame video prediction inspired by predictive coding.

3.1 Predictive coding schema

Motivated by crucial properties of the visual cortex, Rao and Ballard have proposed a hierarchical predictive coding schema with its implementation [rao1999predictive]

. According to this schema, throughout the hierarchy of visual processing, feedback connections from a higher level to a lower level (e.g., from the secondary visual cortex, V2, to the primary visual cortex, V1) transmit predictions of the activity of the lower areas. The error of the prediction is then sent back using the feedforward connections and used to reduce the error in the following moment (see Fig. 

1, (b)).

Fig. 1: Comparison of the hierarchical network for predictive coding by Rao and Ballard and our PreCNet. (a) Components of a Predictive Estimator (PE) module of the model by Rao and Ballard, composed of feedforward neurons encoding the synaptic weights , neurons whose responses

maintain the current estimate of the input signal, feedback neurons encoding

and conveying the prediction to the lower level, and error-detecting neurons computing the difference between the current estimate and its top-down prediction from a higher level. (b) General architecture of the hierarchical predictive coding model. At each hierarchical level, feedback pathways carry predictions of neural activity at the lower level, whereas feedforward pathways carry residual errors between the predictions and actual neural activity. These errors are used by the PE at each level to correct its current estimate of the input signal and generate the next prediction. (c) Components of a PE module of PreCNet architecture (see Section 3.3.1). Figures (a) and (b) redrawn from [rao1999predictive], their captions with minor modification from [rao1999predictive].

This schema was directly turned into a computational model in [rao1999predictive] (see Fig. 1, (a)). The feedback connection from higher-level to lower-level Predictive Estimator (PE) carries the top-down prediction of the lower-level PE activity . The residual error is sent back via feedforward connections to the higher-level PE. The same error with opposite sign, , affects PE activity in the following moment. The bottom-level PE produces a prediction of the visual input.

Drawing on the predictive coding schema, we propose the Predictive Coding Network (PreCNet) (see Fig. 1, (c)). In contrast with the model by Rao and Ballard (compare parts (a), (c) of Fig. 1), PreCNet uses a modern deep learning framework (see Section 3.2 for details of PreCNet architecture and Section 3.3.2 for a more detailed comparison of both models). This has enabled us to create a model based on the predictive coding schema with state-of-the-art performance, as demonstrated on the next-frame video prediction benchmark.

3.2 Description of PreCNet model (ours)

The structure, computation of prediction and states, and training of the model is detailed below.

3.2.1 Structure of the model

The model, shown in Fig. 2, consists of hierarchically organized modules111The model is the same as in Fig. 1, (c). However, in order to enable direct comparison with Rao and Ballard model, it was redrawn in a different arrangement for Fig. 1, (b). The PE from the model of [rao1999predictive] is not equivalent to the “Module” in Fig. 2. See Fig. 3, (b), (e) for a comparison.. A module consists of the following components:

  • A representation layer is a convolutional LSTM () layer (see [xingjian2015convolutional, hochreiter1997long]) with output state (alternatively222For representation layer states we used both small and capital letter . Small corresponds to formalism from [rao1999predictive], capital was used in [lotter2017deep]. ). Technically, it consists of two convolutional LSTM layers () which share hidden and cell states () but differ in the input ( vs.

    ). The input, forget, and output gates use hard sigmoid as an activation function. During calculation of the final (hidden) and cell states, hyperbolic tangent is used.

  • An error representation

    consists of the Rectified Linear Units (ReLU) whose input is obtained by merging errors

    and . The state of the error representation is denoted as .

  • A decoding layer is a convolutional () layer with output state . It uses ReLU as an activation function.

  • An upsample layer, which uses nearest-neighbor method, upscales its input by factor 2. This layer is not present in the module 0.

  • A max-pooling layer which downscales its input by a factor 2. This layer is not present in the module 0.

Fig. 2: Modular architecture of PreCNet. The highest module miss connections upwards (the dashed lines in the figure). Main parts of each module are a representation layer (green), decoding layer (blue) and error representations (red). See the text and Alg.1 for more details.

3.2.2 Computation of the prediction and states

In every time step, PreCNet outputs a prediction of the incoming image. The error of the prediction is then used for the update of the states (see also Fig. 4). The computation in every time step can be divided into two phases:

  1. Prediction phase. The information flow goes iteratively from a higher to a lower module. At the end of this phase (at Module 0), the prediction of the incoming input image is outputted.

  2. Correction phase. In this phase, the information flow goes iteratively up. The error between the prediction and actual input is propagated upward.

In a nutshell, a representational layer (with state ) represents a prediction of the image () or a pooled convLSTM state from the module bellow (). The decoding layer transforms the representation into the prediction . The error representation units then depend on the error of the prediction (difference between the prediction and the image or the pooled state ). The computation is completely described in Alg. 1.

0:  Image , previous () hidden and cell states of the representation layers , previous error state of the (top) module , maximum pixel value .  
  for  {Iterate top-down through the modules} do
     if  {Update the states in the top module} then
        
        
        
     
     if  and {Update the states in the “middle” module then
        
        
        
     
     if  {Update the states in the bottom module} then
        
        
        
  
  for  {Iterate bottom-up through the modules} do
     if  then
        
     
     if  and  then
        
        
     
     if  then
        
Algorithm 1 Calculate PreCNet states at time , assume .

3.2.3 Training of the model

The model is trained by minimizing weighted prediction errors through the time and hierarchy [lotter2017deep]. The loss function is defined as

(1)
(2)

where is loss of the sequence, is the error of the unit in the module at time , is a number of image sequences, is a length of a sequence, is a number of modules, , are time and module weighting factors, is the number of error units in the module. The mini-batch gradient descent was used for the minimization.

3.3 Comparison of PreCNet with other models

We will compare our model with the predictive coding schema [rao1999predictive] and PredNet [lotter2017deep].

3.3.1 Comparison of PreCNet and Rao and Ballard model

PreCNet uses the same schema as the model by Rao and Ballard (see Fig. 1 and Section 3.1). However, as PreCNet is couched in a modern deep learning framework, there are inevitably some differences:

  • Intensity of interaction between the Predictive Estimators (PEs). Each PE of PreCNet is updated just two times during one time step (one input image). Once during the Prediction (top-down) phase and once during the Correction (bottom-up) phase. This means that each PE interact with its neighbour just two times during one time step. On the contrary, the PEs of the model by Rao and Ballard interact with each other many times (until their representation states converge) during one time step. As PreCNet uses the deep learning approach, which is more computationally demanding, such intensive interaction between the PEs is not possible.

  • Dynamic vs. static inputs. In contrast with PreCNet and image sequences as inputs, the model by Rao and Ballard takes static images as inputs. An extension to next frame video prediction should be possible [rao1997dynamic]333 by using recurrent transformation of the representation layer states , where is the prediction of the next state made at time , is a nonlinear function, and are synaptic recurrent weights, but has not been completely demonstrated (in [rao1999optimal], a model with only one level of hierarchy is employed). These recurrent connections resemble the recurrent connections inside PreCNet representation (convLSTM) layer.

  • Different building blocks. Representation layer states of the Rao and Ballard model are determined by a first-order differential equation. Thus, the states are, unlike PreCNet, updated until they converge.

    The error representation of PreCNet consist of merged positive and negative error populations: ,  [lotter2017deep]. These two populations are also used in the model of Rao and Ballard, however, they are not merged and are used separately.

  • One vs. mutliple PEs on one level. There are multiple PEs in one level of the model by Rao and Ballard. Higher level PEs progressively operate on bigger spatial areas than the lower level PEs. PreCNet has one PE in each level of the hierarchy.

  • Different update of representation states. To update the representation states of the model by Rao and Ballard, the difference between the prediction of the PE and the actual input ( in Fig. 1, (a)) and the difference between the actual state of the PE and the predicted one by the higher PE ( in Fig. 1, (a)) are used simultaneously. PreCNet also uses both differences for computation of the new representation states , however, not simultaneously; one difference is used by the during the prediction phase, the second is used by the during the correction phase (notice that the and share cell and hidden unit states).

  • Minimizing error in all levels vs. only the bottom level error. Errors in all levels of the model by Rao and Ballard are minimized. However, PreCNet has achieved better results when only the bottom level error—the difference between the predicted and the actual image—was minimized (see the setting of parameter in Section 4.3.2).

3.3.2 Comparison of PreCNet and PredNet

PredNet, a state-of-the-art deep network for next frame video prediction [lotter2017deep], is also inspired by the model by Rao and Ballard. PredNet and PreCNet (which we propose) are similar in these aspects:

  • Building blocks: error representations, convolutional, and convolutional LSTM networks.

  • Training procedure. For the next frame video prediction task, most training parameters, such as input sequence length and batch size, of PreCNet are taken from PredNet444The motivation was two-fold. Firstly, we wanted to make it clear that the significant improvement of PreCNet over PredNet is not caused by better choice of training parameters. Secondly, few trials with other parameter values that we tried did not lead to significantly better results..

However, there are two crucial properties in which PredNet departs from the predictive coding schema (see Fig. 3, (a), (c), (d)):

  • According to the predictive coding schema, except for the bottom PE, each PE outputs a prediction of the next lower level PE activity (representation layer state). See Section 3.1.

  • No direct connection between two neighboring PE activities and (representation layer states and in formalism of [lotter2017deep]).

Instead, to remain faithful to the predictive coding schema, the building blocks of PreCNet were connected in a significantly different way (see Fig. 3 for comparison).

Fig. 3: Comparison of PredNet and PreCNet. In (a),(b), the differences (connections between the blocks, some building blocks) are highlighted. In (c), (d), (e), there is a comparison of the Predictive Estimators (PEs) of PredNet, PrecNet and the model by Rao and Ballard. Notice that the input from above in (d), (e) – prediction of – is compared with the representation state and the error is used for update of the . The corresponding upper input (blue) of the PredNet is a different entity; it is not related to and is also compared with a different entity (). There is also one more input from above – representation layer state from above – which goes directly into ConvLSTM block of PredNet. PreCNet (see (e)) has overcome these differences and follows the same predictive coding scheme as the model by Rao and Ballard. Notice the correspondence of the olive, purple polygons ((a), (b)) and the PEs of PredNet and PreCNet (the rectangles in (c), (e)). In order to enhance comprehensibility, some of the labels from (a), (b) were added to (c), (d), (e) and v.v. See Supplementary materials – Schema transformation to check the correspondence between both ((a), (b) and (c), (e)) ways of visualization.

These modifications have led to considerably better performance of PreCNet in comparison with PredNet (see Section 4.3.4).

4 Experiments

In this section, the datasets and performance measures are introduced, followed by experiments on next frame and multiple frame video prediction. Trained models and code needed for replication of all the results presented in the paper (dataset preprocessing, model training and evaluation) are available on a GitHub repository [github-precnet].

4.1 Datasets

All datasets used are visual sequences obtained from a car mounted camera. These scenes include fast movements of complex objects (e.g. cars, pedestrians), new objects coming unexpectedly to the scene, as well as movement of the urban background.

For training, we used two different datasets; KITTI [Geiger2013IJRR] and BDD100K [yu2019bdd100k]. For evaluation, we used Caltech Pedestrian Dataset [CVPR09peds, Dollar2012PAMI], employing Piotr’s Computer Vision Matlab Toolbox [PMT] during preprocessing. Using of Caltech Pedestrian Dataset for establishing performance enables direct comparison of the models from both training variants.

  • KITTI dataset and its preprocessing: We followed the preprocessing procedure from [lotter2017deep]. The frames were center-cropped and resized with bicubic method555We do not know which resizing method was originally used by Lotter et al. [lotter2017deep]. to 128 by 160 pixels size (see the repository for code). We also followed the division categories “city”, “residential” and “road” of the KITTI dataset to training (57 recording sessions, approx. 41K of frames) and validation parts in the same way as in [lotter2017deep]. The dataset has 10 fps frame rate.

  • Caltech Pedestrian Dataset and its preprocessing: Frames were preprocessed in the same way as the frames of KITTI dataset (see above). Videos were downsampled from 30 fps to 10 fps (every 3rd frame was taken). As this dataset was used only for evaluation of the performance, only testing parts (set06-set10) were used (approx. 41K of frames).

  • BDD100K and its preprocessing: The preprocessing of the dataset was analogous to the preprocessing of Caltech Pedestrian Dataset, including reducing frame rate from 30 to 10 fps. As the size of the whole dataset is very large (roughly 40M frames if 10 fps is used), we had to randomly choose training and validation subsets of the dataset—see the repository for details and chosen videos. We created two variants of the training dataset; a big one with roughly 2M frames (5000 recording sessions) and a small one with similar size like KITTI training dataset (approx. 41K frames, 105 recording sessions). As a validation dataset, we randomly selected a subset of the validation part of BDD100K with approx. 9K frames.

4.2 Performance measures

For comparison of a predicted with the actual frame, we use standard measures: Mean Square Error (MSE), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM) 

[wang2004image]. MSE is a simple measure whose low values indicate high similarity between frames. PSNR is a related measure to MSE whose value is desired to be as high as possible. Significant limitation of these two is that their evaluation of similarity between two images does not correlate very well with human judgment (e.g., [wang2002image, winkler1999perceptual]). SSIM was created to be more correlated with human perception. SSIM values are bounded to and higher value signifies higher similarity.

4.3 Next frame video prediction

Firstly, the settings of experiments and parameters will be described. This is followed by Quantitative results and Qualitative analysis. Results, achieved by PreCNet, presented in this subsection can be generated by publicly available code [github-precnet].

4.3.1 Experimental settings

We performed experiments with two settings. In both, the performance of trained models was measured using Caltech Pedestrian Dataset (see Section 4.1) which is commonly used for evaluating next frame video prediction task. This also enabled direct comparison of training on both datasets. The training was done on:

  • KITTI dataset. This setting (i.e., KITTI for training, Caltech Pedestrian Dataset for evaluation) is popular for evaluation of next frame video prediction task and enables good comparison with other state of the art methods.

  • BDD100K dataset. Randomly chosen subset of the dataset (approx. 2M of frames) was used. The training dataset is significantly larger than KITTI dataset which enables to avoid overfitting. We also performed training on smaller BDD100K subset with roughly same size as KITTI training dataset.

4.3.2 Network parameters

Main parameters of the network are summarized in Table I. For choosing a suitable number of hierarchical modules, layer sizes (number of channels), and module weight factors (

), KITTI dataset was used for training. We performed a manual heuristic parameter search to minimize mean absolute error (between the predicted and actual frames) on validation set

666If then the mean absolute error between the predicted and actual frames corresponds to 2*loss value (2). This is a consequence of division of error representation to negative and positive parts and using of ReLU. For non zero , this does not hold.

. Padding was used to preserve the size in all convolutional layers (including convLSTM). Values of the pixels of the input frames were divided by 255 to make them in the range

. The filter sizes were taken from [lotter2017deep] (for explanation of this choice, see Section 4.3.3).

module weight #chan. filter size #chan. filter size
i=0 1 3 3 60 3
i=1 0 60 3 120 3
i=2 0 120 3 240 3
TABLE I: Network parameters summary. Parameters of each module in hierarchy are described in a row. Module weights are in the second column. Following columns contain number of channels (chan.)/layer size and filter sizes of decoding (conv) and representation (convLSTM) layers. For detailed explanation see Section 3.2.

4.3.3 Training parameters

Except for training length and learning rate, all the values of the training parameters were same as in [lotter2017deep] (see Section 3.3.2 for the explanation). The network was trained on input sequences with length . During learning, the error related to the first predicted input is ignored (), since the first prediction is produced before seeing any input frame. Prediction errors related to the following time steps are equally weighted ().

In each epoch, 500 sequences from training set were randomly selected to form batches of size 4 and used for weight updates. For validation, 100 randomly selected sequences from validation set were used in each epoch. We used Adam 

[adam2015] as an optimization method for gradient descent on the training loss (1). The values of the Adam parameters were set to their default values (, ).

Training parameters for training on both datasets were very similar except for number of training epochs and learning rate setting. For the KITTI and BDD100K training, the learning consists of 1000 and 10000 epochs, respectively. Learning rate was set to and for first 900, 9900 epochs, respectively777Learning rate setting for BDD100K training led in two of four cases to rapid increase of training loss in later stages of training. Therefore, the learning rate was changed to .. Then it was decreased to for last epochs. As the BDD100K training set is significantly larger than KITTI training set, the training was longer for BDD100K. The choice of the length of the training and learning rate was based on evolution of validation loss and limited computational resources. It means that validation loss still slightly decreased at the final epochs, however, the benefit was not so significant to continue training and use (limited) computational resources.

4.3.4 Quantitative results

For a quantitative analysis of the performance of the model, we used standard procedure and measures for evaluating next frame video prediction. The network obtained a sequence (from Caltech Pedestrian Dataset) of length 10 and then predicted the next frame (see Fig. 4 for details).

Fig. 4: Next frame video prediction evaluation schema. In each time step PreCNet outputs next frame prediction. The predicted error is used for update of the network states. After inputting 10 frames (time step ), the predicted frame is compared—using MSE, PSNR, SSIM—with the actual input. This schema was used for quantitative and qualitative analysis of Next frame video prediction (see Section 4.3).

This frame is compared to the actual frame using MSE, PSNR and SSIM (see Section 4.2). The overall value of each measure is then obtained as a mean of the calculated values for each predicted frame.

We performed 10 training repetitions on KITTI dataset (see Section 4.3.1). The results are summarized in Table II. The results show that the learning is stable.

MSE PSNR SSIM
best value 0.00205 28.4 0.929
worst value 0.00220 28.1 0.928
median 0.00208 28.4 0.928
TABLE II: Performance summary of 10 training repetitions on KITTI dataset. Caltech Pedestrian Dataset was used for calculation of the values. See Section 4.3.4 for details.

We took the best model of the 10 repetitions (according to SSIM) and compared it with state-of-the-art methods (see Table III).

Caltech Pedestrian Dataset
Method MSE PSNR SSIM
Copy last frame 0.00795 23.2 0.779
BeyondMSE [mathieu2016deep] 0.00326 - 0.881
DVF [liu2017video] - 26.2 0.897
DM-GAN [liang2017dual] 0.00241 - 0.899
CtrlGen [hao2018controllable] - 26.5 0.900
PredNet [lotter2017deep] 0.00242 27.6 0.905
RC-GAN [kwon2019predicting] 0.00161 29.2 0.919
ContextVP [byeon2018contextvp] 0.00194 28.7 0.921
DPG [gao2019disentangling] - 28.2 0.923
CrevNet [Yu2020Efficient] - 29.3 0.925
PreCNet (ours) 0.00205 28.4 0.929
TABLE III: Next frame video prediction performance on Caltech Pedestrian Dataset after training on KITTI dataset. The methods are sorted according to SSIM values. If not stated otherwise, a network got ten input images and predicted the next one which was used during performance evaluation. Unless otherwise stated, the values were taken from the original articles. Values for BeyondMSE were taken from [liang2017dual], values for DVF and CtrlGen were taken from [gao2019disentangling]. Values for PredNet were taken from [byeon2018contextvp], because in [lotter2017deep] the values were averaged over nine (2-10) time steps. Values for RC-GAN were calculated after only four input images (not ten), however, the network had better performance in this case than for input sequence of length ten.

PreCNet outperformed all methods in SSIM. In MSE and PSNR, it was outperformed by two and three other methods, respectively.

As the training on BDD100K dataset required long training (large dataset), we performed only two training repetitions. The performance is evaluated in Table IV888Performance of the network from the other training repetition is: MSE 0.00169, PSNR 29.3, SSIM 0.938. .

Caltech Pedestrian Dataset
Training Set #frames #epochs MSE PSNR SSIM
BDD100K 2M 10000 0.00167 29.4 0.938
BDD100K 41K 1000 0.00201 28.6 0.926
KITTI 41K 1000 0.00205 28.4 0.929
TABLE IV: Comparison of PreCNet performance on Caltech Pedestrian Dataset after training on KITTI (same as in Table III) and BDD100K dataset (see Section 4.3.1 for details). Training on BDD100K with 41K frames was performed in order to better compare learning on both datasets. Therefore, all training parameters for BDD100K with 41K frames and KITTI were identical.

Usage of larger dataset led to significant performance improvement in all three measures. Comparing PreCNet trained on large BDD100K subset (2M) with the models trained on KITTI dataset (see Table III), our model achieved the best value also for PSNR. However, RC-GAN trained on the original dataset (KITTI) still slightly outperformed PreCNet ( vs. ) which was trained on the much larger dataset.

In order to evaluate effect of different properties of BDD100K and KITTI datasets on performance, we created a small version of the BDD100K dataset with only approx. 41K frames (similar size as the size of KITTI) and used the same training parameters which were used for training on KITTI. The performance on this dataset was quite similar to performance on KITTI999We performed 3 training repetitions on BDD100K with 41K frames. In Table IV, there is performance of the best one (according to SSIM). Performance of the other two is MSE {0.00199; 0.00202}, SSIM {0.925; 0.926}, PSNR {28.6; 28.6}.. This suggests that the “quality” of the training set (BDD100K vs. KITTI) is not the key factor for obtaining better performance in this case. We studied the effect of the number of training epochs as well. Validation loss on the small subset of BDD100K (41K frames) started to increase during training (1K epochs), indicating overfitting. Thus, we can exclude the possibility that training for 10K epochs would further improve performance. Hence, we claim that it is really the dataset size that is the enabling factor for performance and that permitted the results obtained for BDD100K (2M frames, 10K epochs).

4.3.5 Qualitative analysis

In Fig. 5, there is a qualitative comparison of PreCNet with other state-of-the-art methods trained on KITTI dataset (see Table III). The way of obtaining the predicted frames used for the analysis is the same as for Quantitative analysis (see the predicted frame at in Fig. 4).

Fig. 5: Qualitative comparison of PreCNet with others state-of-the-art methods on Caltech Pedestrian Dataset. All models were trained on KITTI dataset. Ten input frames were given (see frames for ), the next one () was predicted (RC-GAN used only four input frames – see the explanation in Table III) by the models (for references see Table III). The images of predictions of other models are copied from original or other cited papers (see references in Table III). Position of the sequences in Caltech Pedestrian Dataset by rows; set07-v011, set10-v010, set10-v010, set06-v009, set10-v009.

To assess which of the methods is best through visual inspection is not straightforward; none of the models is better than the others in all aspects and shown frames (excluding PredNet which produced significantly worse predictions). For example, in the fourth row of Fig. 5, DPG has generally the sharpest prediction but PreCNet predicted the street lamp significantly better.

In Fig. 6, KITTI and BDD100K (both 2M and 41K) training variants (see Table IV) are compared. Usage of large BDD100K dataset (with approx. 2M frames) for training led to significant improvement of all the measures (see Table IV) in comparison with training on KITTI dataset.

Fig. 6: Qualitative comparison of PreCNet performance on Caltech Pedestrian Dataset after different training variants. First row corresponds to the last frame of the input sequence with length 10. Second row corresponds to the ground truth frame. Next rows correspond to the predicted frames of different models which correspond to the models from quantitative evaluation in Table IV. Position of the sequences in Caltech Pedestrian Dataset by columns; set10-v010, set06-v001, set07-v011, set07-v011. In contrast with Fig. 5, the meaning of horizontal and vertical arrangement is inverted. To see whole input sequences and related predictions check Supplementary materials – Examples of next frame video prediction sequences.

It manifested also in the visual quality of prediction of fast moving cars as you can see in the second and third columns of the figure. The phantom parts of the predicted cars were reduced. It also led to better shapes of the predicted cars as you can see in the prediction in the first column (focus on the front part of the van). On the other hand, in some cases training on BDD100K dataset led to blurrier predictions than training on KITTI (see the last column).

4.4 Multiple frame prediction

For multiple frame prediction, we used the same trained models which we used for next frame video prediction (see Section 4.3). The network had access to the first 10 frames—same as in next frame video prediction. Then, in each timestep, the network produced next frame and this next frame was used as the actual input (as illustrated in Fig. 7).

Fig. 7: Multiple frame video prediction evaluation schema. After inputting 10 frames, the predicted frames are inputted instead of the actual frames. The prediction errors are therefore zeros. The predicted frames are compared—using MSE, PSNR, SSIM—with the actual inputs. We used this schema for both quantitative and qualitative analysis of Multiple frame prediction (see Section 4.4).

Therefore, the prediction error between the prediction and input frame was zero.

We did not use fine-tuning for multiple frame predictions (compare with fine-tuning of PredNet for multiple frame prediction [lotter2017deep]). We preferred to follow the proposed learning mechanism (see Section 3.2.2), which is based on minimizing prediction error only one step ahead, to using another learning mechanism.

Please note the different meaning of timestep labels and : small starts at the beginning of a sequence, in contrast with capital , which starts at the beginning of a predicted sequence (see the timestep labels in Fig. 7). Code needed for generation of the results presented is publicly available [github-precnet].

4.4.1 Quantitative results

In Table V, there is a quantitative comparison of PreCNet, PredNet, CrevNet and RC-GAN for multiple frame prediction. For SSIM, PreCNet trained on KITTI outperformed PredNet until timestep () when the values became equal and then PreCNet started to lose. For PSNR, PreCNet started to lose earlier (). RC-GAN and CrevNet outperformed PreCNet in nearly all timesteps for SSIM101010In , SSIM for PreCNet was and for CrevNet . and RC-GAN also in all timesteps for PSNR.

Method T=1 3 6 9 12 15
PredNet [lotter2017deep] PSNR 27.6 21.7 20.3 19.1 18.3 17.5
SSIM 0.90 0.72 0.66 0.61 0.58 0.54
RC-GAN [kwon2019predicting] PSNR 29.2 25.9 22.3 20.5 19.3 18.4
SSIM 0.91 0.83 0.73 0.67 0.63 0.60
CrevNet [Yu2020Efficient] SSIM 0.93 0.84 0.76 0.70 0.65 -
PreCNet PSNR 28.5 23.4 20.2 18.4 17.2 16.3
(KITTI) SSIM 0.93 0.82 0.69 0.61 0.56 0.53
PreCNet PSNR 29.5 24.6 21.4 19.4 18.3 17.4
(BDD100K 2M) SSIM 0.94 0.85 0.73 0.65 0.59 0.56
TABLE V: A quantitative comparison of selected methods for multiple frame prediction. The methods obtained sequences with fixed length (10 for PredNet, CrevNet and PreCNet, 4 for RC-GAN; see the caption in Table III for explanation) of Caltech Pedestrian Dataset and outputted predictions 15 steps ahead (CrevNet only 12). CrevNet, RC-GAN, PredNet and PreCNet (KITTI) were trained on KITTI. PreCNet was also trained on subset of BDD100K with size 2M. This should be noticed during comparison with the other four trained models. Values for PredNet and RC-GAN were copied from [kwon2019predicting]. Values for CrevNet were taken from [Yu2020Efficient].

We also added PreCNet trained on the large subset of BDD100K to the comparison. Then PreCNet outperformed PredNet in all timesteps for SSIM and most timesteps for PSNR; in timestep it reversed. However, CrevNet and RC-GAN still outperformed PreCNet in most timesteps. For SSIM, PreCNet had better results than CrevNet and RC-GAN only for predicted frames in . For PSNR, RC-GAN was outperformed by PreCNet only for .

In summary, PreCNet started with mostly better predictions than its competitors, however, its performance tended to degrade faster for prediction further ahead.

4.4.2 Qualitative analysis

The methods were compared using the sequences used in [kwon2019predicting]. Fig. 8 provides one example (for another illustration, see Supplementary Materials – Multiple frame video prediction sequence).

Fig. 8: A qualitative comparison of selected methods for multiple frame prediction. The methods obtained sequence with fixed length (10 for PredNet, CrevNet and PreCNet, 4 for RC-GAN; see the caption in Table III for explanation) of Caltech Pedestrian Dataset and outputted predictions 15 steps ahead. RC-GAN, PredNet, CrevNet and PreCNet (KITTI) were trained on KITTI. PreCNet was also trained on subset of BDD100K with size 2M. This should be noticed during comparison with the other four trained models. This figure was obtained from the figure from [kwon2019predicting] by adding sequences for PreCNet and CrevNet (taken from [Yu2020Efficient]). Location of the sequence in Caltech Pedestrian Dataset is set10-v009. Another qualitative comparison (without CrevNet), with different sequence, is in Supplementary materials – Multiple frame video prediction sequence.

Predictions by PreCNet appear less blurred than those by PredNet. This is especially apparent for the later predicted frames. Compared to RC-GAN, predicted frames by PreCNet trained on KITTI seem to have more natural colors and background is mostly less blurred (focus on the buildings in the background). PreCNet trained on large subset of BDD100K (2M of frames) produced even less blurred frames. Comparison with CrevNet is not straightforward. For example, CrevNet captured the geometry of the shadow of the building on the road better than PreCNet. On the other hand, it produced a phantom object (see right side of the road in timesteps ) which is not present (or negligible) in the corresponding frames by PreCNet.

5 Conclusion, Discussion, Future Work

In this work, the seminal predictive coding model of Rao and Ballard [rao1999predictive]—here referred to as predictive coding schema—has been cast into a modern deep learning framework, while remaining as faithful as possible to the original schema. The similarities and differences are elaborated in detail. We also claim and explain that the network we propose (PreCNet) is more congruent with [rao1999predictive] than other machine learning models that take inspiration from predictive coding only; the case of PredNet [lotter2017deep] is studied explicitly.

PreCNet was tested on a widely used next frame video prediction benchmark, which consists of images from an urban environment recorded from a car-mounted camera. On this benchmark (training: 41k images from KITTI dataset; testing: Caltech Pedestrian dataset), we achieved to our knowledge the best performance to date when measured with the Structural Similarity Index (SSIM)—a performance measure that should best correlate with human perception. On two other common measures—MSE, PSNR—the model ranked third and fourth, respectively. Performance on all three measures was further improved (first rank also in PSNR) when a larger training set (2M images from BDD100k; to our knowledge, biggest dataset ever used in this context) was employed. This may suggest that the current practice based on the rather small KITTI dataset used for training may be limiting in the long run. At the same time, the task itself seems highly relevant, as virtually unlimited amount of data and without any need for labeling is readily available.

In multiple frame video prediction, qualitatively, the frames predicted by PreCNet look reasonable and in some aspects better than some of the competitors. However, a quantitative comparison reveals that PreCNet performance degrades slightly faster than that of its competitors when predicting up to 15 frames ahead. This remains to be further analyzed in the future.

In the future, we plan to analyze the representations formed by the proposed network. It would be interesting to study how much of the semantics of the urban scene has the network “understood” and how that is encoded. For example, our network has not quite figured out that every car has a finite length and its end should be predicted at some point when it is not occluded anymore. In our model, best results on the task were achieved when only prediction error on the bottom level—difference between the actual frame and the predicted one—was minimized during learning. Rao and Ballard [rao1999predictive], on the other hand, minimized this error on every level of the network hierarchy, which may have an impact on the representations formed. Testing on a different task, like human action recognition (e.g., [liang2017dual, vondrick2016generating, lotter2017deep]) is also a possibility. Finally, some datasets feature also other signals apart from the video stream. Adding inertial sensor signals or the car’s steering wheel angle or throttle level is another avenue for future research.

We want to close with a discussion of the implications of our model for neuroscience. Casting the predictive coding schema into a deep learning framework has led to unprecedented performance on a contemporary task, without being explicitly designed for it. In the future, we plan to analyze the consequences for computational neuroscience. While receptive field properties in sensory cortices remain an active research area (e.g., [singer2018sensory]), a question remains whether the deep learning approach can lead to a better model than, for example, that of Rao and Ballard [rao1999predictive]. Richards et al. [richards2019deep] and Lindsay [lindsay2020convolutional] provide recent surveys of this perspective.

Acknowledgments

We would like to thank to the authors of PredNet [lotter2017deep] for making their source code public which significantly accelerated the development of PreCNet.

References