We have recently seen a wide and steady release of state-of-the-art feed-forward deep convolutional neural networks for vision related tasks(Canziani et al., 2016). These models have reached, and then also surpassed the human-level performance of object recognition (He et al., 2015)
in the ImageNet classification challenge(Russakovsky et al., 2015)
. Currently, these models are trained end-to-end, using strong supervision. This means that large collections of annotated still-images are fed to the networks, the gradient of a cross entropy loss function with respect to the model parameters is computed with back-propagation, and gradient descent is used to minimise the error between the prediction and the ground truth. We then want to utilise these models for real life applications, feeding them with a stream of video frames, and expecting them to behave similarly well on live data, but this is not often the case.
Furthermore, these models are highly susceptible to inputs corrupted by adversarial noise (Nguyen et al., 2015). Such inputs are made up of small carefully designed perturbations, which are invisible to the normal human vision. For some extent, we can attribute the temporal prediction instability of the feed-forward models to the natural occurrence of adversarial noise. Arguably, our visual system is immune to such temporal perturbations, because in the early years of an individual it has been “trained to see” by performing tracking on specific objects (Association, ) with sporadic parental weak supervision, and not from a large collection of static annotated flash cards.
Therefore, we propose CortexNet, a neural network family which not only models the bottom-up feed-forward connections in our visual system, but also employs delayed modulatory feedback with lateral connections, in order to learn end-to-end a more robust representation of natural temporal visual inputs. We train our models either unsupervisedly or with weak sparse annotations, through leveraging of the temporal coherence which is present among the frames of a natural video clip. Our models show short-term reliable next frames prediction by (1) compensating for the camera egomotion, (2) learning the trajectory of the object present in the current scene, and (3) focussing on one object at the time. Our preliminary results indicate that the network develops an internal salient and attention mechanism. This leads to an effective internal representation of our reality, which demonstrates a superior and more robust network class.
2 Related work
Two main types of work are related to our research direction. The first type uses the natural temporal order of frames in a video as signal — or self-supervised pretext — to train neural networks and learn static visual representations. This eliminates the need of expensive large annotated data sets. On the other hand, the second aims to learn a temporal visual representation directly from the video data itself, by means of future and past frame reconstruction.
2.1 Learning static visual representations from videos
Exploiting the motion present in video data to learn visual representations is a prevalent approach used by self-supervised and unsupervised learning techniques since the frames’ temporal coherency comes to us without any cost.Wang and Gupta (2015) use a triplet loss to train a network so that it learns to differentiate patches belonging to a given tracked object against the patches that do not. In this case, tracking is performed with non trivial algorithms, and it is used to generate positive and negative training samples. In our work, we delegate to the network itself for computing any necessary operation directly on the source video data, and thus training the whole architecture end-to-end. Similarly, the model of Vondrick et al. (2016) demonstrates prediction of the future embedding for a video sequence, given only the current frame, and without a system state. Nevertheless, predicting future representations is ill-posed per se, given that the only real ground truth is the unprocessed reality that is available to the model. Pathak et al. (2016) exploit motion generated segmentation maps for training a neural network to segment objects from a single frame. Once more, even though videos are utilised, the network is still operating in a feed-forward-only configuration and does not exploit temporal cues.
2.2 Learning dynamic visual representations
Remarkably, the most relevant previous work is the seminal NIPS ‘96 paper by Softky (1996)
which uses a three-layer spiking feed-forward and feedback network — using kernels of 4 units, stride of 2, max pooling, and multiplicative signals combiner — in order to predict the next frame in a natural video. Our model can be seen as a conversion and upgrade ofSoftky
’s in a deep learning key, where the main building blocks are strided (de)convolutions, non-linearities, and additive signal mergers.Srivastava et al. (2015) propose to learn to reconstruct the future and past sequences of frames or their representations, by utilising an encoder-decoder recurrent network scheme, fed with 1D unrolled images or embeddings of a feed-forward convolutional net. We propose a model that is aware of temporal variations of its input pixels, and is able to perceive motion in its early layers, operating directly on spatial inputs. The spatio-temporal video auto-encoder of Pătrăucean et al. (2015) is able to predict the next frame in a clip, by using a combination of spatio-recurrent, optical flow, smoothness penalty, grid generator, and sampler modules. Instead of drawing inspiration from standard video encoders and compression schemes, we are motivated by biologically plausible simpler alternatives.
Finally — inspired by the neuroscientific predictive coding theory introduced by Rao and Ballard (1997) and expanded by Friston (2008) — Chalasani and Principe (2013) and Lotter et al. (2016) propose their respective stacks of hand-crafted modules. Instead, we do not choose to engineer our modules, but to learn the necessary operations directly from the input data. We believe that a more generic and simpler architecture structure will provide the ground for an easier comparison and analysis of the learnt internal representation. In addition, we have experimented widely with Lotter et al.
’s PredNet with multiple supervised tasks, however, we were unable to identify an ultimate strategy to utilise the learnt distributed representation. Therefore, we introduce a new network family which could be trained with several losses (although we present here just three of them), which allow us to obtain a usable model, with direct practical impact on working applications.
3 Model architecture family
We introduce a new family of networks which models not only feed-forward (bottom-up) but also lateral (horizontal) and feedback (top-down) connections between cortical areas of the visual system, that have been shown to provide perceptual context modulation (attention) (Lamme et al., 1998). As depicted in fig. 1, the model architecture is composed by discriminative and generative blocks and . Each pair is expected to model a specific cortical area of the human visual system. , fed with the superposition of top-down and bottom-up (also called residual) projections, provides a modulatory input to its correspondent , based on the previous time step (i.e. the connections in fig. 1).
A blow-up of and blocks are shown respectively in figs. 1 and 1. We can observe that the branching and superposition operations happen right after the spatial projection onto the (de)convolutional kernels, as it has been proven to be more promising (He et al., 2016). The (de)convolutional kernels are all with a stride of
and four-sided padding of, and the number of maps are , and additional layers have all features. For blocks, the number of input maps is considered to be doubled due to the concatenation module. Finally, the feedback connections are initialised at time
, to appropriately sized zero tensors.
4 Training frameworks
We apply two schemes to observe two different learning paradigms: MatchNet (unsupervised training configuration), and TempoNet (weakly supervised training configuration) . Both schemes feed the network through the sensory input interface with batches (of size ) of video sequences (with minor side scaled to pixels, and square centre cropped) , in multiple temporal chunks (of length frames). We consider all our videos as they were linearly concatenated into a long clip of frames, and then reshaped it into a rectangle of height and width . The potentially remaining empty positions are filled with up to frames from the first video.
It was our intention to initially pre-train our model unsupervisedly in MatchNet configuration — so that it learns the dynamics of the videos present in our data set — followed by using it as a TempoNet with minimal effort and supervision. As we will see from both the sections 5.2 and 5.1, these two schemes seem to be mutually exclusive when we operate on the pixel-space. Similar findings are reported by Neverova et al. (2017) which are in contrast with the findings of Lotter et al. (2016).
Here we define four loss functions, which allow us to explore the model affinity to learn video features. The future-matching and replica-checking loss functions are both defined as:
where and are two same sized tensors (or tensorial batches), MSE stands for mean squared error, is the elements count operator, and the summation is performed across every dimension. The temporal-stabilisation and periodic-classification loss functions are defined as:
where represents our logits (spatial average pooling of ’s output, named embedding
, which underwent a final linear transformation, in order to haveoutput dimensions) shown in fig. 1, is the correct class label index associated to the current video frame, are the class balancing weights, CE stands for cross entropy, and . We average along the batch dimension, if batches are used. The system’s training loss is defined as a linear combination of the matching , temporal , and periodic and, more precisely, as:
while we use the replica to monitor the training health. We use the Greek letters as mnemonics for the respective losses, i.e. -matching, -replica, -temporal, and -periodic loss.
4.1 MatchNet mode
In MatchNet mode, we unsupervisedly train the model in order to minimise , i.e. matching the next frame appearance within the same video clip. Prediction of (second frame of a generic video ) and or (one frame after the last one for video or the first frame of video ) are disabled (see fig. 2), since they would be erroneous due to state reset or missing data. Our expectation is the ability to predict the future scene, i.e. , that would build an unsupervised internal representation of the video dynamics, which we could later exploit for other tasks. We use then — representing the level of similarity of ’s output and the current input frame — in order to monitor the health of the training procedure, and check whether we are effectively matching the next frame or replicating the current one.
We also employ the every time a video reaches its last frame (see blue triangle in fig. 2), where is the last model embedding for video (see note111 Here we have used an implicit indexing conversion, from the number of frames of video and the location where reaches its last frame in the batched input data . ), indicates ’s index (e.g. if our data set has videos, then ),
is the one-vector, andrepresents the video which has frames. We use only to monitor the prediction loss ( in eq. 3), periodically, at every temporal chunk for all the videos in the current batch (see yellow triangles in fig. 2). While we use to train the model via back-propagation-though-time (BPTT) and learn the video dynamics, we utilise in a static manner, which means that the gradient is not sent backwards to early time steps (notice the pink right-pointing arrow, instead of a double headed one, in correspondence of in fig. 2).
In order to validate our model performance, we split every video into and . Given that the clips’ frame rate is , this means that we use the last two seconds of each video as validation data.
4.2 TempoNet mode
In TempoNet mode we train our model with weak supervision over object class recognition through the periodic loss , where represents the object class of each video in the current batch. We compute each ’s component as:
where represents the number of samples for each of the classes. Moreover, we used to monitor only the training status ( in eq. 3).
To avoid the situation where all the across the batch are held to a constant value, we decided to implement the following data feeding strategy. Each video is split into subsampled videos ; the training set is made of splits, and the validation gets the remaining . In this way the sampling rate goes from to , and the average training video length goes from () of MatchNet mode to (). This means given that , we are going to observe an average of () video changes per temporal chunk vs. () of the previous training scheme, which would have caused the network to converge to an interesting unstable equilibrium point, given that our model is a dynamic non-linear system.
5 Experiments and results
In this section, we present the major results and relative experimental settings that showcase the performance of the CortexNet architecture family. For our experiments we have used the e-Lab Video Data Set (e-VDS35) (Culurciello and Canziani, 2017), a growing collection of currently 970 clips of roughly 10 seconds (
frames) each, capturing one of 35 common life main objects from different perspectives. Duration outliers have been removed by extracting the two-sidedconfidence interval of a Student’s -distribution fitted on the video lengths population. This means removing clips shorter than frames and trimming the ones longer than frames. The project source code can be found at (Canziani and Culurciello, 2017).
5.1 Unsupervised learning
In MatchNet mode, we train our model solely on unlabelled data and exploit the inherent data statistics as learning signals. More precisely, as explained in section 4.1, we feed the network batches of consecutive frames and utilise to predict the next frame and to identify which video we have just finished processing, with no reference to the object class it belongs to. In this way, we leverage only the intrinsic characteristics of our data and supervision is reduced to its minimum. We trained a four-layer CortexNet, with a 970-dimensional output logits on top of . We used a momentum of , a weight decay of , and an initial learning rate of , and had it decay by a factor of
every 10 epochs, for a total of 30 epochs. We set, and in eq. 3. We obtained a compared to , which means we are more than twice as better to predict the next frame vs. barely copying the input one. In fig. 3 we can see how the model behaves on an interesting video from the validation set. The purple line in fig. 3 — defined as — represents the panning speed: the higher its value, and the larger amount of motion has been recorded. The video clip starts with an initial non-zero panning, it slows down in 12 frames, the camera auto-focusses around frame 20, panning increases to its maximum around frame 44 and then it slows down again. The model is constantly tracking the different moving elements, predicting successfully the future frame when the input is not corrupted (blurred by the auto focus).
Surprisingly, as soon as the network learns to accurately predict the next frame in our training clips, the output logits becomes constant, and the . This indicates that the task of generating the future input appearance is effectively executed by a few of the lower pairs. Therefore, we investigate whether we can obtain a more useful high level representation with an alternative training strategy.
5.2 Supervised learning
In TempoNet mode, we exclusively train our model from the top representation through , over the object classes with BPTT (note that MatchNet used video indices and static back-prop for classification). Although we experimented with several values for in eq. 3, it had showed no overall effect on the classification task, and we got as good as compared to for
, probably due to the amount of motion introduced by the subsampling.
We pre-trained the discriminative branch of a six-layer CortexNet with a 33-dimensional output logits on top of , on an image data set of 300k images of objects similar to the ones in e-VDS35 (Culurciello and Canziani, 2017). This data set is a subset of the Open Images one (Krasin et al., 2016). For the pre-training, we used a momentum of , a weight decay of , and an initial learning rate of , and reduced it by every epochs, and trained the feed-forward branch for epochs. The performance of the discriminator branch on a video clip is shown in fig. 4.
Then we added the generative branch, swapped the classifier with a 35-dimensional one, and fine-tuned the whole model on e-VDS35, with, and (in eq. 3). We used a momentum of , a weight decay of , and an initial learning rate of , and reduced it by every epochs, while training the model for a total of epochs. The model that we have obtained is now much more robust to temporal perturbations of the input video stream (see fig. 4), by adopting a selective attention mechanism to focus and track over time the main object present in the scene. Additionally, we are using salient-object-finding algorithms (Bojarski et al., 2017; Canziani and Culurciello, 2015) because we want to visualise dynamically the locations where the network is currently paying attention (see snapshot in fig. 4).
In this paper, we introduce a new kind of neural network family, called CortexNet, which not only model the bottom-up feed-forward connections in the human visual system but employs delayed modulatory feedback and lateral connections, in order to learn end-to-end a more robust representation of natural temporal visual inputs. We explore an unsupervised and a weakly supervised training strategy to train two models on a custom, object-centric video data set. We report performance in terms of prediction mean square error and compare it to the input-matching trivial task, and we show also how the new architecture provide a much more stable prediction output on a testing video clip. Lastly, we observe that the task of predicting a future frame, directly in pixel space, is not complementary to the one of predicting low-frequency labels, such as video index prediction, action recognition or anything that spans several tens of frames in time.
This project leveraged the power, speed, and quick implementation time of PyTorch for all computationally expensive operations. It resorted to the illustrating capabilities of the Inkscape vector graphics software. It also explored and visualised data though the matplotlib library combined with the Jupyter Notebook interactive computational environment. This work was partly sponsored by the Office of Naval Research grants N00014-15-1-2791 and N00014-17-1-2225. We also thank NVIDIA for the donations of graphical processors.
American Optometric Association.
Infant vision: birth to 24 months of age.
- Bojarski et al. (2017) Mariusz Bojarski, Philip Yeres, Anna Choromanska, Krzysztof Choromanski, Bernhard Firner, Lawrence Jackel, and Urs Muller. Explaining how a deep neural network trained with end-to-end learning steers a car. arXiv preprint arXiv:1704.07911, 2017.
- Canziani and Culurciello (2015) Alfredo Canziani and Eugenio Culurciello. Visual attention with deep neural networks. In Information Sciences and Systems (CISS), 2015 49th Annual Conference on, pages 1–3. IEEE, 2015.
Canziani and Culurciello (2017)
Alfredo Canziani and Eugenio Culurciello.
CortexNet project website.
- Canziani et al. (2016) Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
- Chalasani and Principe (2013) Rakesh Chalasani and Jose C Principe. Deep predictive coding networks. arXiv preprint arXiv:1301.3541, 2013.
Culurciello and Canziani (2017)
Eugenio Culurciello and Alfredo Canziani.
e-Lab video data set.
- Friston (2008) Karl Friston. Hierarchical models in the brain. PLoS Comput Biol, 4(11):e1000211, 2008.
He et al. (2015)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: surpassing human-level performance on
Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
- Krasin et al. (2016) Ivan Krasin, Tom Duerig, Neil Alldrin, Andreas Veit, Sami Abu-El-Haija, Serge Belongie, David Cai, Zheyun Feng, Vittorio Ferrari, Victor Gomes, Abhinav Gupta, Dhyanesh Narayanan, Chen Sun, Gal Chechik, and Kevin Murphy. OpenImages: A public dataset for large-scale multi-label and multi-class image classification. https://github.com/openimages, 2016.
- Lamme et al. (1998) Victor AF Lamme, Hans Super, and Henk Spekreijse. Feedforward, horizontal, and feedback processing in the visual cortex. Current opinion in neurobiology, 8(4):529–535, 1998.
- Lotter et al. (2016) William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016.
- Neverova et al. (2017) Natalia Neverova, Pauline Luc, Camille Couprie, Jakob Verbeek, and Yann LeCun. Predicting deeper into the future of semantic segmentation. arXiv preprint arXiv:1703.07684, 2017.
Nguyen et al. (2015)
Anh Nguyen, Jason Yosinski, and Jeff Clune.
Deep neural networks are easily fooled: high confidence predictions
for unrecognizable images.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 427–436, 2015.
- Pathak et al. (2016) Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. arXiv preprint arXiv:1612.06370, 2016.
- Pătrăucean et al. (2015) Viorica Pătrăucean, Ankur Handa, and Roberto Cipolla. Spatio-temporal video autoencoder with differentiable memory. arXiv preprint arXiv:1511.06309, 2015.
- Rao and Ballard (1997) Rajesh PN Rao and Dana H Ballard. Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9(4):721–763, 1997.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- Softky (1996) William R Softky. Unsupervised pixel-prediction. Advances in Neural Information Processing Systems, pages 809–815, 1996.
Srivastava et al. (2015)
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov.
Unsupervised learning of video representations using lstms.
International Conference on Machine Learning, pages 843–852, 2015.
- Vondrick et al. (2016) Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 98–106, 2016.
- Wang and Gupta (2015) Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2015.