CortexNet: a Generic Network Family for Robust Visual Temporal Representations

06/08/2017 ∙ by Alfredo Canziani, et al. ∙ Purdue University 0

In the past five years we have observed the rise of incredibly well performing feed-forward neural networks trained supervisedly for vision related tasks. These models have achieved super-human performance on object recognition, localisation, and detection in still images. However, there is a need to identify the best strategy to employ these networks with temporal visual inputs and obtain a robust and stable representation of video data. Inspired by the human visual system, we propose a deep neural network family, CortexNet, which features not only bottom-up feed-forward connections, but also it models the abundant top-down feedback and lateral connections, which are present in our visual cortex. We introduce two training schemes - the unsupervised MatchNet and weakly supervised TempoNet modes - where a network learns how to correctly anticipate a subsequent frame in a video clip or the identity of its predominant subject, by learning egomotion clues and how to automatically track several objects in the current scene. Find the project website at https://engineering.purdue.edu/elab/CortexNet/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We have recently seen a wide and steady release of state-of-the-art feed-forward deep convolutional neural networks for vision related tasks

(Canziani et al., 2016). These models have reached, and then also surpassed the human-level performance of object recognition (He et al., 2015)

in the ImageNet classification challenge

(Russakovsky et al., 2015)

. Currently, these models are trained end-to-end, using strong supervision. This means that large collections of annotated still-images are fed to the networks, the gradient of a cross entropy loss function with respect to the model parameters is computed with back-propagation, and gradient descent is used to minimise the error between the prediction and the ground truth. We then want to utilise these models for real life applications, feeding them with a stream of video frames, and expecting them to behave similarly well on live data, but this is not often the case.

Furthermore, these models are highly susceptible to inputs corrupted by adversarial noise (Nguyen et al., 2015). Such inputs are made up of small carefully designed perturbations, which are invisible to the normal human vision. For some extent, we can attribute the temporal prediction instability of the feed-forward models to the natural occurrence of adversarial noise. Arguably, our visual system is immune to such temporal perturbations, because in the early years of an individual it has been “trained to see” by performing tracking on specific objects (Association, ) with sporadic parental weak supervision, and not from a large collection of static annotated flash cards.

Therefore, we propose CortexNet, a neural network family which not only models the bottom-up feed-forward connections in our visual system, but also employs delayed modulatory feedback with lateral connections, in order to learn end-to-end a more robust representation of natural temporal visual inputs. We train our models either unsupervisedly or with weak sparse annotations, through leveraging of the temporal coherence which is present among the frames of a natural video clip. Our models show short-term reliable next frames prediction by (1) compensating for the camera egomotion, (2) learning the trajectory of the object present in the current scene, and (3) focussing on one object at the time. Our preliminary results indicate that the network develops an internal salient and attention mechanism. This leads to an effective internal representation of our reality, which demonstrates a superior and more robust network class.

2 Related work

Two main types of work are related to our research direction. The first type uses the natural temporal order of frames in a video as signal — or self-supervised pretext — to train neural networks and learn static visual representations. This eliminates the need of expensive large annotated data sets. On the other hand, the second aims to learn a temporal visual representation directly from the video data itself, by means of future and past frame reconstruction.

2.1 Learning static visual representations from videos

Exploiting the motion present in video data to learn visual representations is a prevalent approach used by self-supervised and unsupervised learning techniques since the frames’ temporal coherency comes to us without any cost.

Wang and Gupta (2015) use a triplet loss to train a network so that it learns to differentiate patches belonging to a given tracked object against the patches that do not. In this case, tracking is performed with non trivial algorithms, and it is used to generate positive and negative training samples. In our work, we delegate to the network itself for computing any necessary operation directly on the source video data, and thus training the whole architecture end-to-end. Similarly, the model of Vondrick et al. (2016) demonstrates prediction of the future embedding for a video sequence, given only the current frame, and without a system state. Nevertheless, predicting future representations is ill-posed per se, given that the only real ground truth is the unprocessed reality that is available to the model. Pathak et al. (2016) exploit motion generated segmentation maps for training a neural network to segment objects from a single frame. Once more, even though videos are utilised, the network is still operating in a feed-forward-only configuration and does not exploit temporal cues.

2.2 Learning dynamic visual representations

Remarkably, the most relevant previous work is the seminal NIPS ‘96 paper by Softky (1996)

which uses a three-layer spiking feed-forward and feedback network — using kernels of 4 units, stride of 2, max pooling, and multiplicative signals combiner — in order to predict the next frame in a natural video. Our model can be seen as a conversion and upgrade of

Softky

’s in a deep learning key, where the main building blocks are strided (de)convolutions, non-linearities, and additive signal mergers.

Srivastava et al. (2015) propose to learn to reconstruct the future and past sequences of frames or their representations, by utilising an encoder-decoder recurrent network scheme, fed with 1D unrolled images or embeddings of a feed-forward convolutional net. We propose a model that is aware of temporal variations of its input pixels, and is able to perceive motion in its early layers, operating directly on spatial inputs. The spatio-temporal video auto-encoder of Pătrăucean et al. (2015) is able to predict the next frame in a clip, by using a combination of spatio-recurrent, optical flow, smoothness penalty, grid generator, and sampler modules. Instead of drawing inspiration from standard video encoders and compression schemes, we are motivated by biologically plausible simpler alternatives.

Finally — inspired by the neuroscientific predictive coding theory introduced by Rao and Ballard (1997) and expanded by Friston (2008)Chalasani and Principe (2013) and Lotter et al. (2016) propose their respective stacks of hand-crafted modules. Instead, we do not choose to engineer our modules, but to learn the necessary operations directly from the input data. We believe that a more generic and simpler architecture structure will provide the ground for an easier comparison and analysis of the learnt internal representation. In addition, we have experimented widely with Lotter et al.

’s PredNet with multiple supervised tasks, however, we were unable to identify an ultimate strategy to utilise the learnt distributed representation. Therefore, we introduce a new network family which could be trained with several losses (although we present here just three of them), which allow us to obtain a usable model, with direct practical impact on working applications.

3 Model architecture family

We introduce a new family of networks which models not only feed-forward (bottom-up) but also lateral (horizontal) and feedback (top-down) connections between cortical areas of the visual system, that have been shown to provide perceptual context modulation (attention) (Lamme et al., 1998). As depicted in fig. 1, the model architecture is composed by discriminative and generative blocks and . Each pair is expected to model a specific cortical area of the human visual system. , fed with the superposition of top-down and bottom-up (also called residual) projections, provides a modulatory input to its correspondent , based on the previous time step (i.e. the connections in fig. 1).

A blow-up of and blocks are shown respectively in figs. 1 and 1. We can observe that the branching and superposition operations happen right after the spatial projection onto the (de)convolutional kernels, as it has been proven to be more promising (He et al., 2016). The (de)convolutional kernels are all with a stride of

and four-sided padding of

, and the number of maps are , and additional layers have all features. For blocks, the number of input maps is considered to be doubled due to the concatenation module. Finally, the feedback connections are initialised at time

, to appropriately sized zero tensors.

Figure 1: (a) Model architecture, (b) discriminative and (c) generative blocks blow-ups, and (d) model’s embedding and logits. (a) The model architecture features two types of modules, called discriminative and generative , which are linked together through feed-forward, lateral, and feedback connections. Vertical connections are drawn in blue, temporal feedback connections in red and residual lateral connections in green. More details about the and blocks are shown in the two blow-ups (b, c), where we can notice that the branching and superposition operations take place right after the (de)convolutional modules. and blocks do not have the concatenation and the superposition modules respectively, and they can be thought as the sensory input and motor output interfaces. (d) Definition of model’s embedding

and logits

.

4 Training frameworks

We apply two schemes to observe two different learning paradigms: MatchNet (unsupervised training configuration), and TempoNet (weakly supervised training configuration) . Both schemes feed the network through the sensory input interface with batches (of size ) of video sequences (with minor side scaled to pixels, and square centre cropped) , in multiple temporal chunks (of length frames). We consider all our videos as they were linearly concatenated into a long clip of frames, and then reshaped it into a rectangle of height and width . The potentially remaining empty positions are filled with up to frames from the first video.

It was our intention to initially pre-train our model unsupervisedly in MatchNet configuration — so that it learns the dynamics of the videos present in our data set — followed by using it as a TempoNet with minimal effort and supervision. As we will see from both the sections 5.2 and 5.1, these two schemes seem to be mutually exclusive when we operate on the pixel-space. Similar findings are reported by Neverova et al. (2017) which are in contrast with the findings of Lotter et al. (2016).

Here we define four loss functions, which allow us to explore the model affinity to learn video features. The future-matching and replica-checking loss functions are both defined as:

(1)

where and are two same sized tensors (or tensorial batches), MSE stands for mean squared error, is the elements count operator, and the summation is performed across every dimension. The temporal-stabilisation and periodic-classification loss functions are defined as:

(2)

where represents our logits (spatial average pooling of ’s output, named embedding

, which underwent a final linear transformation, in order to have

output dimensions) shown in fig. 1, is the correct class label index associated to the current video frame, are the class balancing weights, CE stands for cross entropy, and . We average along the batch dimension, if batches are used. The system’s training loss is defined as a linear combination of the matching , temporal , and periodic and, more precisely, as:

(3)

while we use the replica to monitor the training health. We use the Greek letters as mnemonics for the respective losses, i.e. -matching, -replica, -temporal, and -periodic loss.

Figure 2: (a) MatchNet and (b) TempoNet training configurations. Each black square box represents an instance of the model, which is replicated over time with parameter sharing. The green contour trapezoids represent the computation of the matching loss . The purple trapezoids compute the replica matching loss , which is used for monitoring the training health. The couples of filled green trapezoids are disabled because (1) we reached the last frame of video — and therefore, the model cannot match any new frame for the same video — and (2) the second frame prediction for video would be erroneous — since the state has just been reset. The double headed arrows indicate where back-propagation-though-time is performed, while the right-pointing pink arrows show when the state is propagated forward. However, there is no gradient propagated in the opposite direction (this happens when we reach the end of video , or we start a new BPTT temporal chunk). The blue and yellow triangles represent the computation of the classification losses and respectively, fed with a linear transformation of the model embedding (spatial average pooling of ’s output, see fig. 1).

4.1 MatchNet mode

In MatchNet mode, we unsupervisedly train the model in order to minimise , i.e. matching the next frame appearance within the same video clip. Prediction of (second frame of a generic video ) and or (one frame after the last one for video or the first frame of video ) are disabled (see fig. 2), since they would be erroneous due to state reset or missing data. Our expectation is the ability to predict the future scene, i.e. , that would build an unsupervised internal representation of the video dynamics, which we could later exploit for other tasks. We use then — representing the level of similarity of ’s output and the current input frame — in order to monitor the health of the training procedure, and check whether we are effectively matching the next frame or replicating the current one.

We also employ the every time a video reaches its last frame (see blue triangle in fig. 2), where is the last model embedding for video (see note111 Here we have used an implicit indexing conversion, from the number of frames of video and the location where reaches its last frame in the batched input data . ), indicates ’s index (e.g. if our data set has videos, then ),

is the one-vector, and

represents the video which has frames. We use only to monitor the prediction loss ( in eq. 3), periodically, at every temporal chunk for all the videos in the current batch (see yellow triangles in fig. 2). While we use to train the model via back-propagation-though-time (BPTT) and learn the video dynamics, we utilise in a static manner, which means that the gradient is not sent backwards to early time steps (notice the pink right-pointing arrow, instead of a double headed one, in correspondence of in fig. 2).

In order to validate our model performance, we split every video into and . Given that the clips’ frame rate is , this means that we use the last two seconds of each video as validation data.

4.2 TempoNet mode

In TempoNet mode we train our model with weak supervision over object class recognition through the periodic loss , where represents the object class of each video in the current batch. We compute each ’s component as:

(4)

where represents the number of samples for each of the classes. Moreover, we used to monitor only the training status ( in eq. 3).

To avoid the situation where all the across the batch are held to a constant value, we decided to implement the following data feeding strategy. Each video is split into subsampled videos ; the training set is made of splits, and the validation gets the remaining . In this way the sampling rate goes from to , and the average training video length goes from () of MatchNet mode to (). This means given that , we are going to observe an average of () video changes per temporal chunk vs.  () of the previous training scheme, which would have caused the network to converge to an interesting unstable equilibrium point, given that our model is a dynamic non-linear system.

5 Experiments and results

In this section, we present the major results and relative experimental settings that showcase the performance of the CortexNet architecture family. For our experiments we have used the e-Lab Video Data Set (e-VDS35) (Culurciello and Canziani, 2017), a growing collection of currently 970 clips of roughly 10 seconds (

frames) each, capturing one of 35 common life main objects from different perspectives. Duration outliers have been removed by extracting the two-sided

confidence interval of a Student’s -distribution fitted on the video lengths population. This means removing clips shorter than frames and trimming the ones longer than frames. The project source code can be found at (Canziani and Culurciello, 2017).

5.1 Unsupervised learning

Figure 3: (a) and distances between , and , (b) MSE losses vs. time.
Initially, the network’s output is much closer to the current input (). As soon as two frames have been fed, the network locks on the temporal visual clues, and predicts the next frame reliably () until the camera auto-focus kicks in at , where the input gets completely blurred. After that, the network keeps predicting its egomotion, even at the highest panning speed at . In the upper part, we can see how the network’s output , current and next frames compare to each other at time index . Animation available at (Canziani and Culurciello, 2017).

In MatchNet mode, we train our model solely on unlabelled data and exploit the inherent data statistics as learning signals. More precisely, as explained in section 4.1, we feed the network batches of consecutive frames and utilise to predict the next frame and to identify which video we have just finished processing, with no reference to the object class it belongs to. In this way, we leverage only the intrinsic characteristics of our data and supervision is reduced to its minimum. We trained a four-layer CortexNet, with a 970-dimensional output logits on top of . We used a momentum of , a weight decay of , and an initial learning rate of , and had it decay by a factor of

every 10 epochs, for a total of 30 epochs. We set

, and in eq. 3. We obtained a compared to , which means we are more than twice as better to predict the next frame vs. barely copying the input one. In fig. 3 we can see how the model behaves on an interesting video from the validation set. The purple line in fig. 3 — defined as — represents the panning speed: the higher its value, and the larger amount of motion has been recorded. The video clip starts with an initial non-zero panning, it slows down in 12 frames, the camera auto-focusses around frame 20, panning increases to its maximum around frame 44 and then it slows down again. The model is constantly tracking the different moving elements, predicting successfully the future frame when the input is not corrupted (blurred by the auto focus).

Surprisingly, as soon as the network learns to accurately predict the next frame in our training clips, the output logits becomes constant, and the . This indicates that the task of generating the future input appearance is effectively executed by a few of the lower pairs. Therefore, we investigate whether we can obtain a more useful high level representation with an alternative training strategy.

5.2 Supervised learning

In TempoNet mode, we exclusively train our model from the top representation through , over the object classes with BPTT (note that MatchNet used video indices and static back-prop for classification). Although we experimented with several values for in eq. 3, it had showed no overall effect on the classification task, and we got as good as compared to for

, probably due to the amount of motion introduced by the subsampling.

Figure 4: Probability vs. time index for (a) discriminator only and (b) full CortexNet architecture, and (c) salient regions highlight. In these charts, spanning 5 seconds, we can see how flickered is the output of a feed-forward architecture (CortexNet discriminative branch only) compared to the corresponding full CortexNet model. Even with a temporal varying input, our TempoNet is able to track moving object in the scene, focussing its dynamic attention to it, and steadily predict the correct object identity. In order to better understand how the model is able to flawlessly perform such task, we utilised a salient region highlighter algorithm (Bojarski et al., 2017) to visualise over time, where the network is looking at, and which is included in the snapshot. Animation available at (Canziani and Culurciello, 2017).

We pre-trained the discriminative branch of a six-layer CortexNet with a 33-dimensional output logits on top of , on an image data set of 300k images of objects similar to the ones in e-VDS35 (Culurciello and Canziani, 2017). This data set is a subset of the Open Images one (Krasin et al., 2016). For the pre-training, we used a momentum of , a weight decay of , and an initial learning rate of , and reduced it by every epochs, and trained the feed-forward branch for epochs. The performance of the discriminator branch on a video clip is shown in fig. 4.

Then we added the generative branch, swapped the classifier with a 35-dimensional one, and fine-tuned the whole model on e-VDS35, with

, and (in eq. 3). We used a momentum of , a weight decay of , and an initial learning rate of , and reduced it by every epochs, while training the model for a total of epochs. The model that we have obtained is now much more robust to temporal perturbations of the input video stream (see fig. 4), by adopting a selective attention mechanism to focus and track over time the main object present in the scene. Additionally, we are using salient-object-finding algorithms (Bojarski et al., 2017; Canziani and Culurciello, 2015) because we want to visualise dynamically the locations where the network is currently paying attention (see snapshot in fig. 4).

6 Conclusions

In this paper, we introduce a new kind of neural network family, called CortexNet, which not only model the bottom-up feed-forward connections in the human visual system but employs delayed modulatory feedback and lateral connections, in order to learn end-to-end a more robust representation of natural temporal visual inputs. We explore an unsupervised and a weakly supervised training strategy to train two models on a custom, object-centric video data set. We report performance in terms of prediction mean square error and compare it to the input-matching trivial task, and we show also how the new architecture provide a much more stable prediction output on a testing video clip. Lastly, we observe that the task of predicting a future frame, directly in pixel space, is not complementary to the one of predicting low-frequency labels, such as video index prediction, action recognition or anything that spans several tens of frames in time.

Acknowledgements

This project leveraged the power, speed, and quick implementation time of PyTorch for all computationally expensive operations. It resorted to the illustrating capabilities of the Inkscape vector graphics software. It also explored and visualised data though the matplotlib library combined with the Jupyter Notebook interactive computational environment. This work was partly sponsored by the Office of Naval Research grants N00014-15-1-2791 and N00014-17-1-2225. We also thank NVIDIA for the donations of graphical processors.

References