In the recent years the fusion of bio-inspired and neuromorphic vision models with machine learning has dominated the development of artificial vision systems for the categorization of multiple objects in static frames. Bio-inspired deep networks are computer-vision and computational-neuroscience models of the mammalian visual system implemented in deep neural networks[1, 2, 3, 4, 5, 6]. Most deep network architectures are composed of multiple layers (2, 3 typically), where each layer is composed of: linear two-dimensional filtering, non-linearity, pooling of data, output data normalization [7, 8, 9]. Recent machine learning research has focused on the task of training such deep networks from the abundant digital data available in the form of image frames and videos. In particular, deep networks need to learn good feature representations for complex visual tasks such as object categorization and tracking of objects in space and time, identifying object presence and absence. These representations usually involve learning the linear filter weight values from labeled and unlabeled input data. Since labeled data is costly and often ridden with human errors [10, 11, 12], the recent focus is on learning these features purely from unlabeled input data [13, 14, 15, 16, 17]. These recent methods typically learn multiple layers of deep networks by training several layers of features, one layer at a time, with varying complexity of learning models.
Recent techniques based on unsupervised clustering algorithms are especially promising because they use simple learning methods that quickly converge . These algorithms are easy to setup and train and are especially suited for robotics research, because less complex knowledge of machine learning is needed, environment-specific data can be collected quickly with a few minutes of video, setup of custom size deep networks is quick and can be adapted to specific tasks. In addition, real-time operation with efficient networks can be obtained with only a few minutes of training and setup, leading to quick and direct experimentation in robotic experiments.
In this paper we present results obtained with unsupervised clustering algorithms on the training and operation of deep neural networks for real-time robotic vision systems. We provide simple techniques and open-source software that allows robotic researchers to use deep network in a short setup time and with little or no knowledge of machine learning necessary. The main goal of the paper is not to present state-of-art results on a specific dataset. Rather we use standard published datasets to evaluate the performance of prototype robotic vision system for general-purpose use, where no dataset is available. It is thus not useful to train the network to perform only on one dataset, when the levels of performance would not carry over to another dataset or real-world images. The goal is thus mainly to evaluate the use of unsupervised networks that can support at least ten frames-per-second operation on commercial hardware, such as recent laptop computers.
The paper presents the following key innovations: (1) use of clustering learning for quickly training robotic systems (Section 2), (2) the use of distance-based filtering, as opposed to the standard convolution (Section 2.2), (3) the experimental proof that clustering learning networks can outperform supervised networks on general purpose robotic tracking tasks (Section 4).
In this paper we created and tested a model of unsupervised clustering algorithms (CL) that can quickly learn linear filters’ weight values, and also is amenable to real-time operation with conventional mobile hardware. We used the Torch7 software for all our experiments , since this software can reduce training and learning of deep networks by 5-10 times compared to similar Matlab and Python tools.
2.1 Input data
Input image data was obtained from the CIFAR10  and the Street View House Numbers (SVHN)  datasets. The SVHN dataset has a training size of 73,257 32x32 images and test size of 26,032 32x32 images. The CIFAR10 dataset has a training size of 20,000 32x32 images and a test size of 2,000 32x32 images. Both datasets offer a 10 categories classification task on 32x32 size images. The train dataset was in all cases learned to 100% accuracy, and training was then stopped. Input data was contrast normalized separately on each RGB channel with a 9x9 gaussian filter using the Torch7 ”nn.Spatial Contrastive Normalization” function.
For the real-time networks in Section 3, we used patches from a contrast-normalized version of a few images from the Berkeley image dataset . As testing data for a general-purposed robotic vision system, we decided to test our networks in a tracking task, where the network has to be able to track an object of interest based on a single presentation. For this purpose we use the challenging benchmark TLD dataset . From this dataset we selected multiple videos with different properties of occlusions, camera movement, pose, scale and illumination changes.
Even if other groups showed slight improvements using the YUV color space , we were not able to reproduce the benefits of YUV, so we kept the images in their original RGB space.
Also we did not use whitening of data (such as ZCA whitening) even if other groups have shown clear advantages of using it. We did not use whitening because of two main reasons: first, it is not applicable for general-purpose vision system where an a-priori dataset cannot be obtained. Second, whitening computation is very expensive (similar to the first layer of a convolutional neural network) and we instead replaced it with local contrast normalization, which is a bio-inspired technique to whiten the input data removing its mean bias and adapting the input dynamic range.
2.2 Network architecture
The deep neural network architecture is composed of 4 layers, not counting pooling and normalization operations. Two layers of linear two-dimensional filtering and two layers of output classifier in the form of a fully connected 2-layer neural network. The first two layers were composed of a two-dimensional convolutional linear filtering stage, a L2 norm pooling stage, and a subtractive normalization layer for removing the mean of all outputs. The filters of the first two layers are generated with unsupervised clustering algorithms, as explained below. Training of the last two fully-connected neural network layers was performed with approximately 50-100 epochs on the SVHN dataset on a quad-core Intel i7 laptop, or about 8 hours. Test data maximum precision usually only needed approximately 15 epochs.
The layers in the clustering learning network used the following sequence of operations. Using the naming convention in , for each layer is an input feature map, is an output feature map. The input of each layer is a 3D array with 2D feature maps of size
. Each component (pixel, neuron) is denoted. The output is also a 3D array, composed of feature maps of size .
SpatialSAD module: performing sum-abs-diff operation on images convolutionally with the learned CL filters: , where is the absolute value of .
Spatial Contrastive Normalization: to zero the data mean and standard deviation
L2 pooling over 2x2 regions: , computing the weighted average of the input in an (here ) region with a gaussian kernel .
Spatial Subtractive or Contrastive Normalization: to zero the data mean, reset std to unity. The subtractive normalization operation for a given site computes: , where is a normalized truncated Gaussian weighting window (typically of size 9x9). The divisive normalization computes where . The local contrast normalization layer is inspired by visual neuroscience models.
In order to show the effectiveness of the learning techniques, we compared them to a standard 1-layer and a 2-layers convolutional neural network (CNN) [24, 9, 8], followed by the same 2-layers classifier. The layers in the convolutional neural network used the following sequence of operations:
SpatialConvolution module: performing convolutions on images with the learned CL filters: , where is the 2D discrete convolution operator and is a trainable bias parameter.
L2 pooling over 2x2 regions: as described above.
Tanh nonlinearity: as described above.
Spatial Subtractive Normalization: as described above.
All networks used 16 filters on the first layer, 128 filters on the second layer. The final classifier was fixed to 128 hidden units and 10 output classes for CIFAR and SVHN. Clustering learning networks used a fully connected (all channels connected to all channels) input to 1st, and 1st to 2nd layer, while convolutional neural networks used 1-out-of-3 and 4-out-of-16 random connection table between the input and 1st layer, and 1st and 2nd layer respectively.
Notice that CL networks used sum-abs-diff metrics to correlate filters responses to inputs. This is different to the standard approach of deep networks [1, 15, 19, 7] where convolution operations are used. Our choice of sum-abs-diff operations was dictated by improved performance of the CL filters with respect to convolutions. In a standard modern computer the difference between convolutions (multiplications by weights) and distance metrics (differences, subtractions) is not visible, as multipliers are optimized to perform as fast as the simpler difference operations. On the other hand, when programmable hardware as FPGA is used, the use of differences instead of multipliers can reduce the silicon area utilization, and power up to 15 times on a 16 bit operation, and more for larger number of bits. A full comparison of the hardware advantages of using distance operators instead of convolutions will be the subject of future publications.
We use k-means clustering algorithm to learn a set of 16 filters in the first layer, and 128 filters in the second layer. The techniques and scripts are general and can quickly modified to learn any number of filters. The filter sizes on both layers was set to 5 x 5 pixels for the SVHN datasets. In CIFAR 1st layer we used 5 x 5 filters, and on the 2nd layer, we used 3 x 3 filters, as features were more spatially constrained in this dataset. Clustering used the same size patches of the normalized images, and we used 1 M patches from each dataset to train the first layer. The second layer training was performed by passing the entire dataset through the first layer of the deep neural network. The output dataset was then used again with the same script to train another set of linear filters, by using 1 M patches of the processed dataset.
Clustering learning filters learned on the 2nd layer used as many planes as the output of the 1st layer (16 here). For example we learned 128 filters each with 16 dimensions of 5 x 5 pixels. This was done to cluster different sets of features for each output of the 1st layer, and increased performance by an average of 5% or more.
Examples of the filters learned with CL techniques on a 1st and 2nd layer are given in figure 2. Both layer filters were obtained with training for 10 minutes on a modern quad-core Intel i7 laptop. A supervised network of this size would require several tens of thousands of image examples and several hours of training time.
3 Real-time network
The goal of this paper was to provide a simple and fast method to train unsupervised networks for general-purposed robotic vision system. For real-time experiments we used the TLD dataset . We then compared the tracking performance of a real-time deep network both trained supervised , and with CL techniques described here.
The CL network for this task had the same two-layer network architecture used in 
. This insured real-time operation of 6 frames/s on a quad-code Intel i7 laptop computer. We focused on this network and restricted ourselves to real-time operation because the goal of this project is the use of deep networks in mobile computers. The network operates on 46 x 46 input images, uses 16 filters with 7 x 7 receptive field on the first layer and 128 filters with 7 x 7 receptive field on the second layer. The 1st to 2nd layer fan-in was 8, and the 2nd to 3rd layer fan-in was 64. Both layers were connected with random tables. The network produces a 128 feature vector as output.
We trained the same size and number of filters through clustering algorithm for use in this network. We used patches from a contrast-normalized version of a few images from the Berkeley image dataset . Any set of natural-scene images can be used to train the general-purpose network presented in this section, and we on purpose chose not to sample patches from the target TLD dataset, in order to demonstrate the learning invariance properties of our technique.
4.1 Static datasets
We report the results in the SVHN dataset in figure 3. Here we compared results of accuracy in the test set for 4 cases: clustering learning with 1 layer (CL 1 layer), clustering learning with 2 layers (CL 2 layers), a 1-layer and a 2-layers convolutional neural network (CNN 1l, 2l).
The data shows that clustering learning and CNN with 1 layer provide remarkably the same levels of accuracy on the dataset. This shows that at least on the 1st layer, the features learned with clustering learning are very effective. Adding a second layer brings the convnet to 91% levels of accuracy, which are standard without using any sophisticated tricks and with a small network with only 16 filters on the first layer. On the other had, the clustering learning network with 2 layers showed more than 1% increase in accuracy, from 88% to 89%. This increase is not as large as one would want and expect from adding a second layer, but is consistent with unsupervised learning results [17, 26].
It is interesting to note that with the clustering learning 2 layers network accuracy was above 88%, and plateaued with the train set accuracy plateauing also at 92%. This shows that clustering learning filters also do not over fit, and present non-perfect, but almost identical results on both train and test sets. We note that CNN network trained supervised on SVHN report state-of-the-art performances of 96% and above.
The results above were all obtained with feed-forward hierarchical networks. We also tried to use multiple layers of clustering learning unsupervised networks in parallel, as recommended by other publications [17, 8], but we did not obtain any benefits from that strategy, on the contrary parallel networks always reported losses of 3-5% accuracy with respect to a single layer. This is different from what reported in [17, 26].
We also report here the results in the CIFAR10 dataset in figure 4. As in the SVHN case, we compared results of accuracy in the test set for 4 cases: clustering learning with 1 layer (CL 1 layer), clustering learning with 2 layers (CL 2 layers), a 1-layer and a 2-layers convolutional neural network (convnet 1l, 2l).
The results in the CIFAR10 dataset show a gain of more than 10% from using a single layer convnet to a 2 layer convnet. Clustering learning showed the same behavior as in the SVHN dataset: adding a second layer achieved 3-5% better accuracy on the test set. This dataset uses very small images and is notoriously difficult, and clustering learning only reported a 50% accuracy overall with 2 layers. All results are much lower than the current state of art in this dataset, which is close to 90% . But again we stress that our goal is real-time implementation and other researcher have obtained record results in this dataset only with very large networks, most of which are not amenable to real-time operation.
We tested random filters used on the first 2 layers of the networks for both SVHN and CIFAR10 datasets. The 1st layer of the network was trained with CL, as mentioned above. The randomly connected network used a fixed CNN layer as described in section 2.2. Same qualitative results were obtained with a random CL model. The SpatialConvolution module was initialized with uniformly random filters by the Torch7 tool, and left untouched. We have found that random filters on the 2nd layer of this networks perform with the same levels of precision as the 2nd layers trained with CL. This was true for both SVNH and CIFAR10 datasets.
We also used the filters obtained with CL on a 2-layers standard CNN network. But the use of convolution operation was giving results that were 5-10% less precise that the results obtained with sum-abs-diff operators.
4.2 Dynamic datasets
The results for the comparison for the real-time tracking network on the TLD dataset  are given in Table 1. Each video contains only one target. The metric used is the number of correctly tracked frames. More information and images on this dataset can be found here . All network we tested were real-time networks performing at or above 5 frames/s. As can be seen in Table 1, the Clustering Learning (CL) network performs better than the convolutional neural network from  in all sequences, and is comparable only in one sequence (”Pedestrian 2”). These results are currently not state-of-the-art, which is currently obtained in references [30, 22].
Not only the perfomance of the CL network is higher than CNN on the TLD dataset, but also the CNN from  was trained in a week time, while the CL network was tried in 10 minutes. There is a clear advantage to using CL networks for general-purpose, dataset-free robotic vision tasks.
|Sequence||Frames||Precision: CL||Precision: CNN |
In addition, we present in table 2 a comparison of running time of the most recent deep network work performing at the state-of-the-art in a variety of static datasets. We provide this comparison to show that although these networks perform extremely well on a specific dataset, we point out that similarly to the experimental results in 1, that performance might not carry over to robotic vision tasks such as tracking of previously unseen objects. We also want to point out that the state-of-the-art networks in table 2 are far from real-time operation, and it is not clear in their publication how that can be achieved down-sizing the networks or what performance they might attain.
|Publication||Frames/s||Precision||1st layer filters||Whitening|
|CL (this work)||5-10||89%, SVHN||16||none|
The data provided in table 2 reports published and informally obtained data on the best results obtained in static datasets. Cl networks also reported 50% precision on CIFAR10, a low value compared to [26, 27]. As reference 
precision we used the RGB only data. Most published results did not report computer time. We asked the author to provide us the data for this table, when available. We estimated other network compute time based on the size of the 1st layer of their network, which is usually the most computationally demanding. We estimated frames/s based on the speed of our networks. We want to stress the importance of reporting computing time for deep networks, and the use of whitening, as it is the only way to compare large and small networks and try to find the optimal size for robotic vision system, or other tasks.
We presented results on clustering learning algorithms for general-purpose vision system. These algorithms can be used to train multi-layer feed-forward neural networks in minutes, with fully automated scripts with a reduced set of learning parameters. We show results on static datasets and dynamic tracking of objects in videos. We show results that prove that CL techniques are a viable and quick option to training deep networks. In static datasets, although they do not perform at the state-of-the-art because of their small network size, they can however be run in real-time because of the small network size. The accuracy gap in recognition can be reduced by using more features, as reported in, with an evaluation time linearly proportionalto the number of filters used. We also show that on tracking datasets the CL networks outperforms CNN networks. We believe the tracking dataset is a better approximation of robotic tasks, where locking on a target object is required for approach and manipulation.
We show that sum-abs-diff operators can be more effective than convolution operations for filtering in deep networks. We also point out that custom hardware with these operators will be more efficient in power and space.
Many groups in the field of deep learning work on static datasets to demonstrate the best learning techniques. A lot of these techniques cannot be applied in robotic vision system with the current modern hardware because they cannot perform in real-time ad require too much computational load.
We argue that more research is needed in the field of applied and real-time deep networks, where shortcuts need to be taken in order to optimize performance for speedy operation. The use of custom hardware is recommended  but not necessary in many applications.
CL models are also very interesting to bio-inspired vision research because they provide a close connection between computational neuroscience and machine learning. In particular unsupervised clustering algorithms provide a simplistic model of Hebbian Learning methods, where neurons that respond to the same input are clustered [34, 35, 36, 37] . Most of previous work was mathematically complex and not efficient to implement. In this paper we thus present one of the first practical application of Hebbian-like learning applied to deep networks. Its fast and simple computation can help researchers quickly train complex neural networks.
Further research is also needed to justify unsupervised learning on the 2nd or later layers. Our experiments indicate that random filters are as good as learned CL filters. This fact was also shown in other papers [26, 31] where random filters were used in the 2nd layer, after one CL-trained layer with a large number of features (128 or above).
All the code for this paper is available here: https://github.com/culurciello/CL_paper1_code.
We are especially grateful to the the Torch7 developing team, in particular Ronan Collobert, who first developed this great, easy and efficient tool, Clement Farabet, Koray Kavukcuoglu, Leon Bottou. We could not have done any of this work without standing on these giants shoulders.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
R. Hadsell, S. Chopra, and Y. LeCun.
Dimensionality reduction by learning an invariant mapping.
Proc. Computer Vision and Pattern Recognition Conference (CVPR’06). IEEE Press, 2006.
-  K. Gregor, A. Szlam, and Y. LeCun. Structured sparse coding via lateral inhibition. In Advances in Neural Information Processing Systems (NIPS 2011), volume 24, 2011.
-  M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 2:1019–1025, 1999.
-  T. Serre, A. Oliva, and T. Poggio. A feedforward architecture accounts for rapid categorization. Proceedings of the National Academy of Sciences, 104(15):6424–6429, 2007.
-  T. Serre and T. Poggio. A neuromorphic approach to computer vision. Communications of the ACM, 53(10):54–61, 2010.
-  K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09). IEEE, 2009.
-  Y. LeCun, K. Kavukvuoglu, and C. Farabet. Convolutional networks and applications in vision. In Proc. International Symposium on Circuits and Systems (ISCAS’10). IEEE, 2010.
-  Y. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in vision algorithms. In Proc. International Conference on Machine learning (ICML’10), 2010.
-  A. Karpathy. Lessons learned from manually classifying CIFAR-10. http://karpathy.ca/myblog/2011/04/27/lessons-learned-from-manually-classifying-cifar-10-with-code/, April 2011.
-  A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, page 1521–1528, 2011.
-  X. Hou, A. Yuille, and H. Koch. A meta-theory of boundary detection benchmarks. In NIPS Workshop on Human Computation for Science and Computational Sustainability, 2012.
-  B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
-  A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4):411–430, 2000.
-  G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol.
Extracting and composing robust features with denoising autoencoders.In Proceedings of the 25th international conference on Machine learning, page 1096–1103, 2008.
-  A. Coates, H. Lee, and A. Y. Ng. An analysis of single-layer networks in unsupervised feature learning. AISTATS, 14, 2011.
-  R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In NIPS Workshop BigLearn, 2011.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, volume 2011, 2011.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 2, page 416–423, 2001.
-  Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(7):1409–1422, 2012.
-  B. A. Wandell. Foundations of vision. Sinauer Associates, 1995.
-  P. Sermanet, S. Chintala, and Yann LeCun. Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition (ICPR 2012), 2012.
-  C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Scene parsing with multiscale feature learning, purity trees, and optimal covers. arXiv preprint arXiv:1202.2160, 2012.
-  A. Coates and A. Y. Ng. Selecting receptive fields in deep networks. Advances in Neural Information Processing Systems, 24:2528–2536, 2011.
-  D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, page 3642–3649, 2012.
-  Z. Kalal, J. Matas, and K. Mikolajczyk. P-n learning: Bootstrapping binary classifiers by structural constraints. Conference on Computer Vision and Pattern Recognition, 2010.
-  Z. Kalal. TLD dataset. http://info.ee.surrey.ac.uk/Personal/Z.Kalal/TLD/TLD_dataset.pdf, 2010.
-  A. Dundar, J. Jin, and E. Culurciello. Visual tracking with similarity matching ratio. arXiv preprint arXiv:1209.2696, 2012.
-  R. Socher, B. Huval, B. Bhat, C. D. Manning, and A. Y. Ng. Convolutional-recursive deep learning for 3D object classification. Advances in Neural Information Processing Systems, 2012.
-  A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 2012.
-  C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A runtime reconfigurable dataflow processor for vision. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on, page 109–116, 2011.
-  T. D. Sanger. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural networks, 2(6):459–473, 1989.
-  P. Földiák. Forming sparse representations by local anti-hebbian learning. Biological cybernetics, 64(2):165–170, 1990.
-  E. Oja. Principal components, minor components, and linear neural networks. Neural Networks, 5(6):927–935, 1992.
-  A. J. Bell and T. J. Sejnowski. The “independent components” of natural scenes are edge filters. Vision research, 37(23):3327, 1997.