Code and models accompanying "Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning"
The quality of data representation in deep learning methods is directly related to the prior model imposed on the representations; however, generally used fixed priors are not capable of adjusting to the context in the data. To address this issue, we propose deep predictive coding networks, a hierarchical generative model that empirically alters priors on the latent representations in a dynamic and context-sensitive manner. This model captures the temporal dependencies in time-varying signals and uses top-down information to modulate the representation in lower layers. The centerpiece of our model is a novel procedure to infer sparse states of a dynamic model which is used for feature extraction. We also extend this feature extraction block to introduce a pooling function that captures locally invariant representations. When applied on a natural video data, we show that our method is able to learn high-level visual features. We also demonstrate the role of the top-down connections by showing the robustness of the proposed model to structured noise.READ FULL TEXT VIEW PDF
Convolutional Neural Networks (CNNs) are the state-of-the-art algorithms...
Dynamic textures exist in various forms, e.g., fire, smoke, and traffic ...
In this paper, we study a new problem arising from the emerging MPEG
Bayesian predictive coding is a putative neuromorphic method for acquiri...
Self-supervised speech representations have been shown to be effective i...
We propose a novel architecture, the event-based GASSOM for learning and...
Hyper-parameter selection remains a daunting task when building a patter...
Code and models accompanying "Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning"
Code and models accompanying "Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning"
This is a repository for building neural networks. Just modeling for understanding NNs.
The performance of machine learning algorithms is dependent on how the data is represented. In most methods, the quality of a data representation is itself dependent on prior knowledge imposed on the representation. Such prior knowledge can be imposed using domain specific information, as in SIFT, HOG , etc., or in learning representations using fixed priors like sparsity , temporal coherence , etc. The use of fixed priors became particularly popular while training deep networks [5, 6, 7, 8]. In spite of the success of these general purpose priors, they are not capable of adjusting to the context in the data. On the other hand, there are several advantages to having a model that can “actively” adapt to the context in the data. One way of achieving this is to empirically alter the priors in a dynamic and context-sensitive manner. This will be the main focus of this work, with emphasis on visual perception.
Here we propose a predictive coding framework, where a deep locally-connected generative model uses “top-down” information to empirically alter the priors used in the lower layers to perform “bottom-up” inference. The centerpiece of the proposed model is extracting sparse features from time-varying observations using a linear dynamical model. To this end, we propose a novel procedure to infer sparse states
(or features) of a dynamical system. We then extend this feature extraction block to introduce a pooling strategy to learn invariant feature representations from the data. In line with other “deep learning” methods, we use these basic building blocks to construct a hierarchical model using greedy layer-wise unsupervised learning. The hierarchical model is built such that the output from one layer acts as an input to the layer above. In other words, the layers are arranged in a Markov chain such that the states at any layer are only dependent on the representations in the layer below and above, and are independent of the rest of the model. The overall goal of the dynamical system at any layer is to make the bestprediction of the representation in the layer below using the top-down information from the layers above and the temporal information from the previous states. Hence, the name deep predictive coding networks (DPCN).
The DPCN proposed here is closely related to models proposed in [9, 10], where predictive coding is used as a statistical model to explain cortical functions in the mammalian brain. Similar to the proposed model, they construct hierarchical generative models that seek to infer the underlying causes of the sensory inputs. While Rao and Ballard 
use an update rule similar to Kalman filter for inference,Friston 
proposed a general framework considering all the higher-order moments in a continuous time dynamic model. However, neither of the models is capable of extracting discriminative information, namely a sparse and invariant representation, from an image sequence that is helpful for high-level tasks like object recognition. Unlike these models, here we propose an efficient inference procedure to extract locally invariant representation from image sequences and progressively extract more abstract information at higher levels in the model.
Other methods used for building deep models, like restricted Boltzmann machine (RBM), auto-encoders [12, 8] and predictive sparse decomposition , are also related to the model proposed here. All these models are constructed on similar underlying principles: (1) like ours, they also use greedy layer-wise unsupervised learning to construct a hierarchical model and (2) each layer consists of an encoder and a decoder. The key to these models is to learn both encoding and decoding concurrently (with some regularization like sparsity , denoising  or weight sharing ), while building the deep network as a feed forward model using only the encoder. The idea is to approximate the latent representation using only the feed-forward encoder, while avoiding the decoder which typically requires a more expensive inference procedure. However in DPCN there is no encoder. Instead, DPCN relies on an efficient inference procedure to get a more accurate latent representation. As we will show below, the use of reciprocal top-down and bottom-up connections make the proposed model more robust to structured noise during recognition and also allows it to perform low-level tasks like image denoising.
To scale to large images, several convolutional models are also proposed in a similar deep learning paradigm [5, 7, 6]. Inference in these models is applied over an entire image, rather than small parts of the input. DPCN can also be extended to form a convolutional network, but this will not be discussed here.
In this section, we begin with a brief description of the general predictive coding framework and proceed to discuss the details of the architecture used in this work. The basic block of the proposed model that is pervasive across all layers is a generalized state-space model of the form:
where is the data and and are some functions that can be parameterized, say by . The terms are called the unknown causes. Since we are usually interested in obtaining abstract information from the observations, the causes are encouraged to have a non-linear relationship with the observations. The hidden states, , then “mediate the influence of the cause on the output and endow the system with memory” . The terms and are stochastic and model uncertainty. Several such state-space models can now be stacked, with the output from one acting as an input to the layer above, to form a hierarchy. Such an -layered hierarchical model at any time ’’ can be described as111When , i.e., at the bottom layer, , where the input data.:
The terms and form stochastic fluctuations at the higher layers and enter each layer independently. In other words, this model forms a Markov chain across the layers, simplifying the inference procedure. Notice how the causes at the lower layer form the “observations” to the layer above — the causes form the link between the layers, and the states link the dynamics over time. The important point in this design is that the higher-level predictions influence the lower levels’ inference. The predictions from a higher layer non-linearly enter into the state space model by empirically altering the prior on the causes. In summary, the top-down connections and the temporal dependencies in the state space influence the latent representation at any layer.
In the following sections, we will first describe a basic computational network, as in (1) with a particular form of the functions and . Specifically, we will consider a linear dynamical model with sparse states for encoding the inputs and the state transitions, followed by the non-linear pooling function to infer the causes. Next, we will discuss how to stack and learn a hierarchical model using several of these basic networks. Also, we will discuss how to incorporate the top-down information during inference in the hierarchical model.
To begin with, we consider a dynamic network to extract features from a small part of a video sequence. Let be a -dimensional sequence of a patch extracted from the same location across all the frames in a video222Here is a vectorized form of
is a vectorized form ofsquare patch extracted from a frame at time . . To process this, our network consists of two distinctive parts (see Figure.0(a)): feature extraction (inferring states) and pooling (inferring causes). For the first part, sparse coding is used in conjunction with a linear state space model to map the inputs at time onto an over-complete dictionary of -filters, , to get sparse states . To keep track of the dynamics in the latent states we use a linear function with state-transition matrix . More formally, inference of the features is performed by finding a representation that minimizes the energy function:
Notice that the second term involving the state-transition is also constrained to be sparse to make the state-space representation consistent.
Now, to take advantage of the spatial relationships in a local neighborhood, a small group of states , where represents a set of contiguous patches w.r.t. the position in the image space, are added (or sum pooled) together. Such pooling of the states may be lead to local translation invariance. On top this, a -dimensional causes are inferred from the pooled states to obtain representation that is invariant to more complex local transformations like rotation, spatial frequency, etc. In line with , this invariant function is learned such that it can capture the dependencies between the components in the pooled states. Specifically, the causes are inferred by minimizing the energy function:
where is some constant. Notice that here multiplicatively interacts with the accumulated states through , modeling the shape of the sparse prior on the states. Essentially, the invariant matrix is adapted such that each component connects to a group of components in the accumulated states that co-occur frequently. In other words, whenever a component in is active it lowers the coefficient of a set of components in , making them more likely to be active. Since co-occurring components typically share some common statistical regularity, such activity of typically leads to locally invariant representation .
Though the two cost functions are presented separately above, we can combine both to devise a unified energy function of the form:
where . As we will discuss next, both and can be inferred concurrently from (5) by alternatively updating one while keeping the other fixed using an efficient proximal gradient method.
To learn the parameters in (5), we alternatively minimize using a procedure similar to block co-ordinate descent. We first infer the latent variables while keeping the parameters fixed and then update the parameters while keeping the variables fixed. This is done until the parameters converge. We now discuss separately the inference procedure and how we update the parameters using a gradient descent method with the fixed variables.
We jointly infer both and from (5) using proximal gradient methods, taking alternative gradient descent steps to update one while holding the other fixed. In other words, we alternate between updating and using a single update step to minimize and , respectively. However, updating is relatively more involved. So, keeping aside the causes, we first focus on inferring sparse states alone from , and then go back to discuss the joint inference of both the states and the causes.
Inferring States: Inferring sparse states, given the parameters, from a linear dynamical system forms the crux of our model. This is performed by finding the solution that minimizes the energy function in (3) with respect to the states (while keeping the sparsity parameter fixed). Here there are two priors of the states: the temporal dependence and the sparsity term. Although this energy function is convex in , the presence of two non-smooth terms makes it hard to use standard optimization techniques used for sparse coding alone. A similar problem is solved using dynamic programming , homotopy  and Bayesian sparse coding ; however, the optimization used in these models is computationally expensive for use in large scale problems like object recognition.
To overcome this, inspired by the method proposed in  for structured sparsity, we propose an approximate solution that is consistent and able to use efficient solvers like fast iterative shrinkage thresholding alogorithm (FISTA) . The key to our approach is to first use Nestrov’s smoothness method [20, 18] to approximate the non-smooth state transition term. The resulting energy function is a convex and continuously differentiable function in with a sparsity constraint, and hence, can be efficiently solved using proximal methods like FISTA.
To begin, let where . The idea is to find a smooth approximation to this function in . Notice that, since is a linear function on , the approximation will also be smooth w.r.t. . Now, we can re-write using the dual norm of as
where . Using the smoothing approximation from Nesterov  on :
where is a smoothing function and is a smoothness parameter. From Nestrov’s theorem , it can be shown that is convex and continuously differentiable in and the gradient of with respect to takes the form
where is the optimal solution to 333Please refer to the supplementary material for the exact form of .
. This implies, by using the chain rule, thatis also convex and continuously differentiable in and with the same gradient.
With this smoothing approximation, the overall cost function from (3) can now be re-written as
with the smooth part whose gradient with respect to is given by
Inferring Causes: Given a group of state vectors, can be inferred by minimizing , where we define a generative model that modulates the sparsity of the pooled state vector, . Here we observe that FISTA can be readily applied to infer , as the smooth part of the function :
is convex, continuously differentiable and Lipschitz in  444The matrix is initialized with non-negative entries and continues to be non-negative without any additional constraints .. Following , it is easy to obtain a bound on the convergence rate of the solution.
Joint Inference: We showed thus far that both and can be inferred from their respective energy functions using a first-order proximal method called FISTA. However, for joint inference we have to minimize the combined energy function in (5) over both and . We do this by alternately updating and while holding the other fixed and using a single FISTA update step at each iteration. It is important to point out that the internal FISTA step size parameters are maintained between iterations. This procedure is equivalent to alternating minimization using gradient descent. Although this procedure no longer guarantees convergence of both and to the optimal solution, in all of our simulations it lead to a reasonably good solution. Please refer to Algorithm. 1 (in the supplementary material) for details. Note that, with the alternating update procedure, each is now influenced by the feed-forward observations, temporal predictions and the feedback connections from the causes.
With and fixed, we update the parameters by minimizing in (5) with respect to
. Since the inputs here are a time-varying sequence, the parameters are updated using dual estimation filtering; i.e., we put an additional constraint on the parameters such that they follow a state space equation of the form:
where is Gaussian transition noise over the parameters. This keeps track of their temporal relationships. Along with this constraint, we update the parameters using gradient descent. Notice that with a fixed and , each of the parameter matrices can be updated independently. Matrices and are column normalized after the update to avoid any trivial solution.
Mini-Batch Update: To get faster convergence, the parameters are updated after performing inference over a large sequence of inputs instead of at every time instance. With this “batch” of signals, more sophisticated gradient methods, like conjugate gradient, can be used and, hence, can lead to more accurate and faster convergence.
So far the discussion is focused on encoding a small part of a video frame using a single stage network. To build a hierarchical model, we use this single stage network as a basic building block and arrange them up to form a tree structure (see Figure.0(b)). To learn this hierarchical model, we adopt a greedy layer-wise procedure like many other deep learning methods [11, 6, 8]. Specifically, we use the following strategy to learn the hierarchical model.
For the first (or bottom) layer, we learn a dynamic network as described above over a group of small patches from a video. We then take this learned network and replicate it at several places on a larger part of the input frames (similar to weight sharing in a convolutional network ). The outputs (causes) from each of these replicated networks are considered as inputs to the layer above. Similarly, in the second layer the inputs are again grouped together (depending on the spatial proximity in the image space) and are used to train another dynamic network. Similar procedure can be followed to build more higher layers.
We again emphasis that the model is learned in a layer-wise manner, i.e., there is no top-down information while learning the network parameters. Also note that, because of the pooling of the states at each layers, the receptive field of the causes becomes progressively larger with the depth of the model.
With the parameters fixed, we now shift our focus to inference in the hierarchical model with the top-down information. As we discussed above, the layers in the hierarchy are arranged in a Markov chain, i.e., the variables at any layer are only influenced by the variables in the layer below and the layer above. Specifically, the states and the causes at layer are inferred from and are influenced by (through the prediction term ) 555The suffixes indicating the group are considered implicit here to simplify the notation.. Ideally, to perform inference in this hierarchical model, all the states and the causes have to be updated simultaneously depending on the present state of all the other layers until the model reaches equilibrium . However, such a procedure can be very slow in practice. Instead, we propose an approximate inference procedure that only requires a single top-down flow of information and then a single bottom-up inference using this top-down information.
For this we consider that at any layer a group of input are encoded using a group of states and the causes by minimizing the following energy function:
where . Notice the additional term involving when compared to (5). This comes from the top-down information, where we call as the top-down prediction of the causes of layer using the previous states in layer . Specifically, before the “arrival” of a new observation at time , at each layer (starting from the top-layer) we first propagate the most likely causes to the layer below using the state at the previous time instance and the predicted causes . More formally, the top-down prediction at layer is obtained as
At the top most layer, , a “bias” is set such that , i.e., the top-layer induces some temporal coherence on the final outputs. From (13), it is easy to show that the predicted states for layer can be obtained as
These predicted causes are substituted in (12) and a single layer-wise bottom-up inference is performed as described in section 2.2.1 666Note that the additional term in the energy function only leads to a minor modification in the inference procedure, namely this has to be added to in (10).. The combined prior now imposed on the causes, , is similar to the elastic net prior discussed in , leading to a smoother and biased estimate of the causes.
Firstly, we would like to test the ability of the proposed model to learn complex features in the higher-layers of the model. For this we train a two layered network from a natural video. Each frame in the video was first contrast normalized as described in . Then, we train the first layer of the model on overlapping contiguous pixel patches from this video; this layer has 400 dimensional states and 100 dimensional causes. The causes pool the states related to all the patches. The separation between the overlapping patches here was pixels, implying that the receptive field of the causes in the first layer is pixels. Similarly, the second layer is trained on causes from the first layer obtained from overlapping pixel patches from the video. The separation between the patches here is pixels, implying that the receptive field of the causes in the second layer is pixels. The second layer contains 200 dimensional states and 50 dimensional causes that pools the states related to all the patches.
Figure 2 shows the visualization of the receptive fields of the invariant units (columns of matrix ) at each layer. We observe that each dimension of causes in the first layer represents a group of primitive features (like inclined lines) which are localized in orientation or position 777Please refer to supplementary material for more results.. Whereas, the causes in the second layer represent more complex features, like corners, angles, etc. These filters are consistent with the previously proposed methods like Lee et al.  and Zeiler et al. .
In this section, we show the role of the top-down information during inference, particularly in the presence of structured noise. Video sequences consisting of objects of three different shapes (Refer to Figure 3
) were constructed. The objective is to classify each frame as coming from one of the three different classes. For this, severalpixel 100 frame long sequences were made using two objects of the same shape bouncing off each other and the “walls”. Several such sequences were then concatenated to form a 30,000 long sequence. We train a two layer network using this sequence. First, we divided each frame into patches with neighboring patches overlapping by 4 pixels; each frame is divided into 16 patches. The bottom layer was trained such the patches were used as inputs and were encoded using a 100 dimensional state vector. A contiguous neighboring patches were pooled to infer the causes that have 40 dimensions. The second layer was trained with first layer causes as inputs, which were itself inferred from contiguous overlapping blocks of the video frames. The states here are 60 dimensional long and the causes have only 3 dimensions. It is important to note here that the receptive field of the second layer causes encompasses the entire frame.
We test the performance of the DPCN in two conditions. The first case is with 300 frames of clean video, with 100 frames per shape, constructed as described above. We consider this as a single video without considering any discontinuities. In the second case, we corrupt the clean video with “structured” noise, where we randomly pick a number of objects from same three shapes with a Poisson distribution (with mean 1.5) and add them to each frame independently at a random locations. There is no correlation between any two consecutive frames regarding where the “noisy objects” are added (see Figure.2(b)).
First we consider the clean video and perform inference with only bottom-up inference, i.e., during inference we consider . Figure 3(a) shows the scatter plot of the three dimensional causes at the top layer. Clearly, there are 3 clusters recognizing three different shape in the video sequence. Figure 3(b) shows the scatter plot when the same procedure is applied on the noisy video. We observe that 3 shapes here can not be clearly distinguished. Finally, we use top-down information along with the bottom-up inference as described in section 2.4 on the noisy data. We argue that, since the second layer learned class specific information, the top-down information can help the bottom layer units to disambiguate the noisy objects from the true objects. Figure 3(c) shows the scatter plot for this case. Clearly, with the top-down information, in spite of largely corrupted sequence, the DPCN is able to separate the frames belonging to the three shapes (the trace from one cluster to the other is because of the temporal coherence imposed on the causes at the top layer.).
In this paper we proposed the deep predictive coding network, a generative model that empirically alters the priors in a dynamic and context sensitive manner. This model composes to two main components: (a) linear dynamical models with sparse states used for feature extraction, and (b) top-down information to adapt the empirical priors. The dynamic model captures the temporal dependencies and reduces the instability usually associated with sparse coding 888Please refer to the supplementary material for more details., while the task specific information from the top layers helps to resolve ambiguities in the lower-layer improving data representation in the presence of noise. We believe that our approach can be extended with convolutional methods, paving the way for implementation of high-level tasks like object recognition, etc., on large scale videos or images.
This work is supported by the Office of Naval Research (ONR) grant #N000141010375. We thank Austin J. Brockmeier and Matthew Emigh for their comments and suggestions.
Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2, ICCV ’99, pages 1150–, 1999. ISBN 0-7695-0164-8.
Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 1 - Volume 01, CVPR ’05, pages 886–893, 2005. ISBN 0-7695-2372-2.
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 609–616, 2009. ISBN 978-1-60558-516-1.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.The Journal of Machine Learning Research, 11:3371–3408, 2010.
The optimal solution of in (6) is given by
where is a function projecting onto an -ball. This is of the form: