Learning features invariant to arbitrary transformations in data is a requirement for any biological or artificial recognition system. In recent years, there has been a surge of interest in learning feature hierarchies from data using multilayered or deep learning models largely motivated by the layered organization of the neocortex. It is now widely accepted that the simple and complex layers in the primary visual cortex (or V1) are responsible for learning transformation-invariant features. A number of computational models have been proposed that can learn transformation-invariant features for state-of-the-art recognition in images, audio and videos using alternating simple and complex layers, such as Neocognitron 3, 4] and HMAX [5, 6]. Given the biological relevance and technological usefulness of the simple-complex layers, it is imperative to understand their function as a canonical computational unit.
The contribution of this paper is a fully-learnable model with only two manually tunable parameters, the learning rate and threshold decay parameter, for learning invariant features from spatiotemporal data, such that the model may be used as a canonical computational unit in deep neural networks for recognition and prediction. In the proposed model, the functions of the simple and complex layers have similar formulation in space and time respectively signifying their functional similarity.
In particular, we present a two-layered neural model (architecture and learning algorithm) that operates in a feedforward (or bottom-up), unsupervised and online manner. We show that:
1. Spatial features may be learned in the first layer by spatial clustering on the surface of a hypersphere of unit radius (a.k.a. spherical clustering 
) where the outliers are not allowed to influence the cluster centers. When learned from natural videos, these features resemble the receptive fields (RFs) of simple cells in V1. We will refer to the first layer of our model as the simple layer.
2. Arbitrary transformations of these features may be learned from time-varying data in the second layer by temporal spherical clustering where the outliers are not allowed to influence the cluster centers. When learned from natural videos, the response properties of the second layer neurons resemble that of complex cells in V1. We will refer to the second layer as the complex layer.
3. Predictive capability may be induced in this model by learning transition probabilities in lateral connections among the simple layer neurons. Higher-order predictions may be made by using the transformations learned in the complex layer in conjunction with the lateral connections.
4. A topographic map of the spatial features emerges by exponentially decaying the flow of activation with distance from one neuron to another in the same layer that fire in close temporal proximity. Unlike other models (e.g., [8, 9]) where the pooling regions are predefined or some sort of group sparsity is assumed to learn topographic maps from spatial data, we exploit the temporal continuity of data and physical constraints to learn topographic feature map.
1.1 Receptive fields
In our model, the goal of simple and complex neurons may be conceptualized as learning subsets from a pool of neurons in space and time respectively. A postsynaptic neuron integrates activations from presynaptic neurons over space and time. Each neuron has a spatial RF and a temporal RF, both of fixed sizes. The size of a feature it will encode may be less than or equal to its spatial and temporal RF sizes. All neurons in a layer have the same sized RFs. The size of spatial RF of a simple neuron in layer is defined by the number of neurons in the lower layer (i.e., layer ) reporting to it at any time instant. The goal of the simple neuron in layer is to get strongly connected (connections may be excitatory or inhibitory) to a subset of neurons within its spatial RF in order to encode the recurring spatially coincident patterns in layer . When this subset of presynaptic neurons fire, the postsynaptic simple neuron is highly likely to fire. The spatial RF size of a complex neuron is unity, i.e. at any time instant, a postsynaptic complex neuron can receive input from only one presynaptic simple neuron. This implicitly assumes the winner-take-all mechanism in the simple layer. All simple neurons in layer report to all complex neurons in layer albeit at different time instants.
In our model, a neuron in layer samples the input stream every instants of time, where is referred to as the temporal RF size of the neuron. Complex neurons in layer sample the input at a lower frequency than simple neurons in layer , i.e. . Conceptually, a neuron in layer fails to distinguish the temporal sequence of events occurring within instants of time, and hence considers all of those events to occur simultaneously. However, a neuron in layer can keep track of the temporal sequence of events occurring within time instants due to higher sampling frequency. We use this insight to model the feedforward weights to learn sets and the lateral weights in conjunction with feedforward weights to learn sequences.
The goal of a complex neuron in layer is to get strongly connected to a subset of simple neurons in layer in order to encode the recurring temporally coincident patterns in layer where each pattern corresponds to an instance of an arbitrary transformation. The size of this subset is at most . When this subset of presynaptic simple neurons fire in close temporal proximity (i.e., within time instants), the postsynaptic complex neuron is highly likely to fire. The temporal RF size of a simple neuron is unity with respect to that of neurons in the lower layer, i.e. a postsynaptic simple neuron integrates activations from its lower layer over only one time instant. Thus, a simple neuron integrates activations from presynaptic neurons over space and fires if its threshold is crossed while a complex neuron integrates activations from presynaptic simple neurons over time and fires if its threshold is crossed. Then, being able to learn sets from a pool of neurons in space and time is the crux of the invariant feature learning problem.
1.2 Objective function
Formally, we define a set as a finite collection of distinct alphabets, written as where is a -dimensional alphabet or feature or event, implies , and is the cardinality of , i.e. . We define a sequence over the set as a finite ordered list of alphabets from , written as where , implies occurs before , does not imply , and is the length of . Therefore, learning a subset of features from recurring coincidences in the data requires clustering into a set of clusters . Soft-clustering is a better option for natural data.
Formation of a cluster may be viewed as a pseudo-event that occurs where (in case of spatial clustering) or when (in case of temporal clustering) all or most of the events in the cluster occur. Let,
where denotes location in case of spatial clustering and time in case of temporal. Also,
Clustering may then be defined as an optimization problem that minimizes the following objective function:
where are the parameters of the model, and () are the observations. . The that minimizes
is a maximum a posteriori probability (MAP) estimate assuming uniform prior. This formulation is similar tocorrelation clustering ; it automatically recovers the underlying number of clusters .
In the next section, we present a two-layered neural network model where neurons in lower layer are activated by spatial pseudo-events in while those in higher layer are activated by temporal pseudo-events in . The feedforward weights are learned using the simplest form of Hebbian rule to minimize in an unsupervised and online manner. In Section 3, we describe experimental results on natural spatiotemporal data followed by conclusions.
2 Network model
Our network architecture consists of a hierarchy of layers of nodes (see Fig. 1). A node is a canonical computational unit consisting of two layers – simple neurons in the lower layer, and complex neurons in the higher layer (see Fig. 1). Neurons in a node are sparsely connected to neurons in the neighboring nodes in the same layer, one layer above and one layer below by lateral, feedforward and feedback connections respectively. The first (or lowest) layer in the hierarchy receives input from external data varying in space and time. In this paper, we will concentrate on learning invariant features in a node using the feedforward and lateral connections only.
We will refer to the layer that receives external inputs as the input layer, denoted by . The simple and complex layers in a node will be denoted by and respectively. Each neuron in is connected to all neurons in in a feedforward manner (see Fig. 1). The feedforward weights are denoted by , where is the weight or strength of connection from the neuron in layer to the neuron in layer at time . Each neuron in is also connected to all neurons in its own layer, except itself, by lateral connections (see Fig. 1). The lateral weights are denoted by . We will assume the number of neurons in reporting to the simple layer in a node is equal to the spatial RF size of a simple neuron in and that each neuron in is connected to all neurons in in a feedforward manner.
The goal of feedforward processing in the perceptual cortices is hypothesized to be rapid categorization . Our model samples the input stream at regular intervals of time. At each sampling instant, it accepts spatial data as input through which is passed on to and in the form of activations. The goal of computations in a node is to selectively cluster the input into groups. Over time, each simple neuron in a node gets tuned to a unique feature which represents a spatial cluster center while each complex neuron gets tuned to a unique transformation. Functionally, a node is a bag of invariant filters all of which are applied to each patch of the input data.
A simple neuron in integrates activations from presynaptic neurons in over its spatial RF and fires if the integrated input crosses its threshold. Activations of simple neurons in at time are:
where is the external input, are the states of neurons in , and are activations due to feedforward and lateral interactions respectively. Each feature in , and are normalized, hence is the normalized dot product of the input with each feature modulated by the lateral interaction. This allows a simple neuron to act as a suspicious coincidence detector , responding with high activation if the input matches the feature encoded in its RF.
A complex neuron in integrates activations from presynaptic neurons in over its temporal RF and fires if the integrated input crosses its threshold. The activations of complex neurons in at time are:
where is a time instant from when the neurons start integrating, . Each feature in is normalized to have unit norm. A complex neuron acts as a temporal coincidence detector.
The state of the neuron in any layer is binary, given by
where is the threshold of the neuron in layer at time . This threshold is adaptive and unique for each neuron. Only the maximally activated neuron (or winner) in a layer is assigned the state 1 if its threshold is exceeded. Our model implements the winner-take-all mechanism which allows only the neuron of highest activity to learn. We say a neuron has fired if its state reaches .
Thus, a neuron integrates all inputs over its spatial and temporal RF until it reaches its threshold when it fires if it is the winner. As soon as it fires or if it fails to fire within the duration of its temporal RF, it discharges and then starts integrating again. The discharge from a neuron inhibits neighboring neurons in its own layer. As in , it may be assumed that this lateral inhibition is proportional to a neuron’s total accumulated charge (or activation) and operates at a faster time scale. The inhibition is required to ensure that neurons in a layer do not get tuned to the same feature set. The inhibition influences a neuron’s activation which in turn influences its inhibition. This cycle ensues until a stable state is reached. In most practical cases, this inhibition is observed to be strong enough to drive all neurons close to their baseline activation. In our implementation, we assume this baseline to be zero which does not effect our features qualitatively.
2.4 Updating weights and thresholds
Feedforward weights to neuron in layer with are updated following Hebbian rule.
where , is the learning rate that decreases with time for finer convergence, , . This weight update rule is obtained by applying gradient descent on the objective function in equ. 4 in an online setting. Feedforward weights leading to each neuron are initialized to ones and normalized to have unit norm, which allows all neurons in a layer to compete on an equal footing. A new neuron is not recruited unless the incoming pattern is more similar to the initialized feature than to any of the learned features. After each update, weights to each neuron are normalized to have unit norm. Thus, feedforward connection from a presynaptic neuron () to a postsynaptic one () that fire together are strengthened while the rest (to ) are weakened. The weakening of connections is crucial for robustness as it helps remove infrequent coincident patterns from memory which are probably noise.
In , the lateral weight from neuron to is also updated following Hebbian rule as:
Thus, connection from a presynaptic neuron () to a postsynaptic one () that fire at consecutive time instants are strengthened while the rest (from ) are weakened. The weights are randomly initialized in , , such that , and the above learning rule ensures that constraint continues to be satisfied. Since is extremely sparse, can store a number of patterns from their correlations at consecutive time instants.
The threshold is updated as follows:
where is the threshold decay parameter, a constant, . Due to the threshold, only a small subset of stimuli can trigger learning. The threshold decay ensures that the size of this subset remains fixed throughout the learning process, thereby maintaining the plasticity of the network. The winner-take-all mechanism along with threshold favor neurons with sparsely distributed activity.
In the proposed model, a winner neuron always passes on its activations to its neighboring neurons in all layers irrespective of whether it fires or not. This is crucial for online operation where learning and inferencing proceed simultaneously and not in distinct phases. If a pattern has been learned and a part of it is shown, a partial pattern of activations will stimulate the remaining neurons of the pattern to become active thereby completing the whole pattern. However, the strength of connections will not be altered unless enough of the pattern has been seen (as determined by ) and the RFs of the presynaptic neurons are the best match to the incoming pattern to fire the postsynaptic neuron in the higher layer.
3 Experimental results
The proposed model was deployed for learning visual features in a node from spatiotemporal data in an unsupervised and online manner. The feedforward weights were learned layer by layer with , . were initialized to a value slightly greater than such that the longest sequences may be captured. . As stimuli we used 17 videos recorded at different natural locations with a CCD camera mounted on a cat’s head exploring its environment . These videos provided a continuous stream of stimuli similar to what the cat’s visual system is naturally exposed to, preserving its temporal structure. The same catcam videos were used in [13, 15] for evaluating models on learning complex cell RF properties. As preprocessing, each frame ( pixels) was converted to grayscale and convolved with a Laplacian of Gaussian kernel followed by rectification to crudely highlight edges, believed to be performed by center-surround cells before the signal reaches V1. Spatiotemporal voxels of size pixels spanning over the entire duration of a video were extracted at fixed points from a grid, sampled every 25 pixels. These 99 voxels from each video formed our stimuli, leading to a total of about 5.3 million patches from the 17 videos.
3.1 Simple layer
Our model was simulated with 625 simple neurons in with spatial RF size pixels. Each simple neuron learned a unique visual feature from the stimuli. Qualitatively, the features belonged to three distinct classes of RFs – small unoriented features, localized and oriented Gabor-like filters, and elongated edge-detectors (see Fig. 2). Such features have been observed in macaque V1, and have been reported to be learned by computational models such as SAILnet  and SSC .
If lateral connections encode transition probabilities and minimization of wiring length is an objective, neurons that fire in close temporal proximity will end up being spatial neighbors. Furthermore, if the stimulus changes gradually, neighboring neurons will develop similar feature preferences. In order to learn features in a topographic map, we organize the simple layer neurons on a 2D grid. At any time , the activation of the winner neuron at time is propagated to its neighbor (), the effect of which exponentially decreases with square of the distance between and on the grid. For neighbor at time , the propagated activation is:
where is a constant, . At any time , in addition to feedforward activation, each simple neuron receives an activation from a neighboring winner in the same layer. The simple layer activation is:
where , is the number of neurons in . The second term biases neighboring neurons to become the winner at the next instant. As a result, simple neurons that fire in close temporal proximity end up being spatially close in the 2D grid (hence, equations 12 and 5 are functionally equivalent). Consequently, the wiring length for pooling by complex neurons is reduced, in agreement with biological evidence [18, 19]. The topographic map is shown in Fig. 2. The pooling region in this topographic map as learned by each complex neuron is shown in Fig. 4.
3.2 Complex layer
Our model was simulated with 25 complex neurons in with temporal RF size of 21 sampling instants. Being exposed to the catcam videos, each complex neuron got strongly connected to a subset of simple neurons in i.e., it learned a unique transformation to which it is now invariant. The spatial feature encoded by each simple neuron in this subset is an instance of the transformation. The activation of a complex neuron is high if the spatial stimulus matches any of these spatial features, and low otherwise. Thus, the response of complex neurons in our model is akin to that of complex cells in V1.
Due to the nature of stimulus, our model was exposed to sequences of spatial stimuli in the catcam video. Repeating sequences, if learned, would be useful for prediction. When trained with a sequence (e.g., ), a complex neuron in our model responds much more vigorously (as measured by its activation) to the corresponding set (e.g., ) than to any other (e.g., ), where each alphabet refers to a unique spatial feature. Further, it responds more vigorously to the training sequence than to any other (e.g., ), thereby manifesting the complex neuron’s direction selectivity. This is achieved by exploiting the set learned by the complex neuron in conjunction with the transition probabilities learned by the lateral connections in the simple layer. The difference in activations towards the training sequence and any of its other permutation depends on how often other permutations of the set are presented. If no other permutation is presented, the difference in activations is high. In V1, 10-20% cells show marked direction selectivity .
Prediction in our model amounts to computing the probability of the simple neuron being the winner at time given that the simple neuron was the winner at time , i.e. probability of given , which depends on the transition probabilities as well as the sets learned by the complex neurons. At any instant, the winner complex neuron (say, ) restricts the set for the expected winner simple neuron. The highest expected one is then chosen from this set using the transition probabilities.
where is the uniform prior distribution. Fig. 3 shows the entropy of the system as it converges with learning. Fig. 4 shows the sets and sequences learned by eight neurons in our model. To reconstruct the sequence learned by a neuron, we select the strongest connected feature from its set; its successor is that feature from the set that has the strongest lateral connection (the algorithmic implementation of equ. 13), and so on until a feature is repeated, signifying the end of sequence.
Learning features invariant to arbitrary transformations in the data is a requirement for any recognition system, biological or artificial. Biological evidence and computational models have supported the role of simple-complex layers in V1 in achieving this goal. To understand their function as a canonical computational unit in a hierarchical or deep network, we presented a novel two-layered neural model that operates in a feedforward, unsupervised and online manner. When exposed to natural videos recorded with a camera mounted on a cat’s head, the first layer neurons learned three classes of spatial features that resemble the RFs in macaque V1 while the second layer neurons learned arbitrary transformations in the data, their activations were then invariant to these transformations akin to the response of complex cells in V1. The learning rules for the two layers were derived from the same objective function signifying their functional similarity. The simple and complex RFs were learned by spherical clustering in space and time respectively where the outliers were not allowed to influence the cluster centers.
The model could make higher-order predictions by simultaneously exploiting the transformations learned in the complex layer and transition probabilities learned by the lateral connections in the simple layer. We showed the convergence of this predictive model while learning from the catcam videos. Unlike other models with predefined pooling regions or presumed group sparsity for learning topographic maps from spatial data, we used temporal continuity of data and physical constraints to learn topographic feature map. The proposed model is fully-learnable with only two manually tunable parameters – the learning rate and threshold decay parameter. We conclude that the model is an ideal candidate to be used as a canonical computational unit in a hierarchical network for real world applications and understanding biological brain functions.
Research reported in this paper was partially supported by the U.S. National Science Foundation under CISE Grant No. 1231620.
-  D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat s visual cortex. Journal of Physiology, 160:106–154, 1962.
-  K. Fukushima. Neocognitron for handwritten digit recognition. Neurocomputing, 51(1):161–180, 2003.
-  Y. LeCun and Y. Bengio. Convolutional networks for images, speech and time series. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 255–258. MIT Press, 1995.
-  D. C. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In , pages 3642–3649, 2012.
-  M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11):1019–1025, 1999.
-  T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. Robust object recognition with cortex-like mechanisms. IEEE Trans. Pattern Analysis and Machine Intelligence, 29:411–426, 2007.
-  I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1-2):143–175, 2001.
-  A. Hyvarinen and P. O. Hoyer. A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41(18):2413–2423, 2001.
-  K. Kavukcuoglu, M. A. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic filter maps. In Proc. Intl. Conf. Computer Vision and Pattern Recognition, 2009.
-  S. Bagon and M. Galun. Large scale correlation clustering optimization. Computing Research Repository, arXiv:1112.2903, 2011.
-  T. Serre, A. Oliva, and T. Poggio. A feedforward architecture accounts for rapid categorization. Proc. Natl. Academy of Sciences, 104(15):6424–6429, 2007.
-  P. Földiák. Forming sparse representations by local anti-hebbian learning. Biological Cybernetics, 64:165–170, 1990.
-  W. Einhäuser, C. Kayser, P. König, and K. P. Körding. Learning the invariance properties of complex cells from their responses to natural stimuli. European Journal of Neuroscience, 15(3):475–486, 2002.
-  B. Y. Betsch, W. Einhäuser, K. P. Körding, and P. König. The world from a cat’s perspective – statistics of natural videos. Biological Cybernetics, 90(1):41–50, 2004.
-  T. Masquelier, T. Serre, S. Thorpe, and T. Poggio. Learning complex cell invariance from natural videos: A plausibility proof. Technical Report 60, MIT, Cambridge, MA, December 2007.
-  J. Zylberberg, J. T. Murphy, and M. R. DeWeese. A sparse coding model with synaptically local plasticity and spiking neurons can account for the diverse shapes of V1 simple cell receptive fields. PLoS Computational Biology, 7(10):e1002250, 2011.
-  M. Rehn and F. T. Sommer. A network that uses few active neurones to code visual input predicts the diverse shapes of cortical receptive fields. Journal of Computational Neuroscience, 22(2):135–146, 2007.
-  G. G. Blasdel. Orientation selectivity, preference, and continuity in monkey striate cortex. Journal of Neuroscience, 12(8):3139 –3161, 1992.
-  G. C. DeAngelis, G. M. Ghose, I. Ohzawa, and R. D. Freeman. Functional micro-organization of primary visual cortex: Receptive field analysis of nearby neurons. Journal of Neuroscience, 19(10):4046 –4064, 1999.
-  D. H. Hubel. Eye, Brain, and Vision. W. H. Freeman, 2nd edition, May 1995.