Abstract
]
In this paper we introduce a biologically inspired CNN architecture that has a first convolutional layer that mimics the role of the LGN. The first layer of the net shows a rotationally symmetric pattern justified by the structure of the net itself that turns up to be an approximation of a Laplacian of Gaussian. The latter function is in turn a good approximation of the receptive profiles of the cells in the LGN. The analogy with respect to the visual system structure is established, emerging directly from the architecture of the net.
1. Introduction
]
Convolutional Neural Networks (CNNs) have been inspired by the modular structure of the visual cortices leading to an architecture that contains many layers that try to mimic some behavior of the visual system. In particular, each convolutional layer obtains its input from the previous convolutional layer. The connections and similarities between CNNs and visual system have been widely studied in the last years; however, CNNs have been developed independently reaching stateoftheart performances in many areas.
In [9] the authors introduce new smoothness functionals based on the regularization theory on CNNs. The authors in [21] develop a CNN architecture based on the structure of the visual cortices, justifying each layer of the net by well known behavior of the visual system, as for example introducing a first layer composed by Gabor filters with different orientations and scales. In [26] it has been studied the strict connections between each cortical layer and convolutional layer based on the encoding and decoding ability of the visual system, using goaldriven hierarchical convolutional neural networks. Furthermore, in [27], the authors model the ventral stream (the series of cortical areas thought to subserve object recognition) comparing it with fMRI voxel responses. Moreover, in [17]
the authors obtain Gabor shape filters by training an unsupervised learning algorithm on natural images.
Recurrent Neural networks (see e.g. [22]) have been introduced to implement the horizontal connectivity typical of the layers of the natural visual cortex. A modification of these nets, more geometric and more similar to the structure of the brain, have been recently proposed in [16].
Despite these similarities, the structure of an artificial neural network and the human brain is still very different. In particular in most cases the net is able to reconstruct a bank of Gabor shape filters in the first convolutional layer which are approximations of the Receptive Profiles (RPs) of the cells in V1. In [21] where the authors describe the connections between each CNN layer and the visual cortices, the role of the Lateral Geniculate Nucleus (LGN) has not been investigated. In our paper we aim to introduce a CNN architecture that tries to mimic this behavior. In particular, since the RPs of LGN cells are rotationally symmetric, we focus on the relation between the architecture of the CNNs and the invariance properties of their filters, in order to find an architecture that gives rotationally symmetric filters in the first convolutional layer.
As regards CNNs, their invariance properties are one of the main theoretical problem about them. Convolutional layers in classical CNNs are translationinvariant since all filters are applied all over the input. This allows the net in the case of image classification to recognize the image despite the position of the object.
A lot of invariances and symmetries are present at the level of the filters obtained through the backpropagation algorithm. One of the most powerful and most studied is the rotation invariance. Indeed, it often happens that at a certain layer the same filter appears several times rotated by different angles. This behavior has suggested in some cases to modify the structure of a CNN imposing the rotation invariance directly to the filters. In
[15] and [25] the authors rotate the filters in each layer by different angles and apply them to the input. This allows to reduce the number of filters at each layer and, as a consequence, the number of parameters and the complexity of the net decrease giving results comparable to stateoftheart CNNs.Another method for introducing invariance under rotations is data augmentation, which consists in incrementing the number of training images by flipping, rotating, rescaling, cropping, adding gaussian noise to them. In this way the net is trained on a larger set of images and can learn different kinds of invariances. In [6] and [5] the authors rotate the same image by different angles in order to achieve rotation invariance.
In [4], [12] and [7] the authors introduce one or more novel layers to face the rotation invariance problem. In [4] they propose four different layers that work on the input rotated by 0, 90, 180 and 270 degrees. In [12] a different kind of pooling is used before the fully connected layers in order to obtain invariance with respect to different features. A kernel that finds out symmetries in the input connecting consecutive layers is introduced in [7].
In the present work, we propose to build a CNN architecture inspired by the structure of visual cortices called LGNCNN. In particular, the idea is to introduce a first layer that behaves similarly to the LGN that prefilters the visual stimulus obtained by the retina, highlighting the contours of the objects present in the stimulus. The RPs of cells in the LGN can be approximated by a Laplacian of Gaussian (LoG) which is rotationally symmetric (for a review see for example [18]). Thus, our first layer contains a single filter: this enforces the development of rotation symmetry during the training phase, so that the filter eventually approximates a LoG.
As regards the filters of our CNN obtained after training, we expect that the second layer of the net mimics the behavior of V1 which is the first visual cortex that analyzes the visual stimulus after the LGN. Simple cells in V1 have been studied for a long time and their RPs can be approximated by a Gabor function (see e.g. [2], [11], [13], [18]). Thus, we have studied their shapes by approximating the filters obtained after the training phase with a Gabor function. We have compared the results obtained on a LGNCNN and on a classical CNN with the RPs of a macaque’s V1 from the work of Ringach ([19]). We have observed that the filters of the second layer obtained with our CNN are more similar to the neuroscience data of macaque’s RPs with respect to classical CNN. This enforces the link between our architecture and the visual system structure, at least as regards the LGN cells and simple cells in V1.
2. The visual system
]
The visual system is one of the most studied and most understood area of the brain. We will describe only the parts that interested the most our studies. For a complete overview see for example [18], [23], [10].
The retina is a lightsensitive layer of tissue of the eye which receives the visual stimulus and translates it into electrical impulses. These impulses reach firstly the LGN which is a part of the thalamus whose cells preprocess the visual stimulus. Then the impulse is processed by the cells of V1, whose output is taken in input to all the other layers of the visual cortex.
We are mainly interested in the cells of the LGN and in the simple cells of V1. Each cell receives the electrical impulse from a little portion of the retina called receptive field (RF). The RF of each cell is divided in excitatory and inhibitory areas which are activated by the light and that can be modeled as a function called receptive profile (RP). Thus, if the excitatory areas are activated the cell will fire whereas it will be silenced in the case of inhibitory areas activation. Figure 1 shows the RP of a LGN cell that can be modeled by a laplacian of gaussian (LoG). Indeed it will highlight the contours of the objects in the image whereas it will set to zero all the uniform areas. On the other hand figure 2 shows the RPs of two simple cells of V1 modeled by Gabor functions. These cells will fire if the contour of the object passing in their RF has a similar orientation w.r.t. their RPs.
3. Introducing LGNCNN architecture
]
In this section we have introduced the main novelty of this paper, a CNN architecture inspired by the structure of the visual system. In particular, we have investigated the role of the LGN in the visual system whose receptive profile cells can be approximated by a LoG. Since LoG is a rotationally symmetric function, we aim to find a CNN architecture that achieves this property in the first layer and eventually attains a LoG shape.
3.1. Relations between a CNN and the visual system
Many authors have studied the connections between a CNN and the RPs of simple cells in V1 (see e.g. [26], [27], [17], [9], [21]). In particular, the first convolutional layer of a CNN is usually composed by a bank of Gabor filters obtained after the training phase. This suggests that the first convolutional layer analyzes the contours of the objects present in the image similarly to what happens in V1. Also the architecture of a CNN has many points in common with the structure of the visual system. Just to recall some of them:

It is composed by many convolutional layers that sequentially analyzed the image, similarly to the visual cortices of the visual system;

After each convolutional layer it applies a non linearity that sets to zero all the negative values, similarly to what happens to the neurons that fire only if they reach a certain amount of voltage;

The spatial dimensions decrease after each convolutional layer giving the opportunity to the filters of the next convolutional layer to analyze the information from a larger area of the starting image, similarly to what happens to neurons of following visual cortices that receive information from a group of neurons of the previous visual cortex.
There are several differences between CNNs and the visual system, as for example the fact that the RPs of cells are contextual and vary according to the visual stimulus. Also the structure of the visual cortices is more complex, since the neurons in the same cortex communicates each other and following layers could communicate with previous ones giving feedback information. However the studies of the similarities between CNNs and the visual system could improve the comprehension of both of them. In the next subsections we are going to describe which problem we have faced and how.
3.2. Introducing LGN in a CNN
As far as we know, the action of the LGN has been ignored and it has not been implemented in a CNN. As we have already discussed in subsection 2 the RP of a LGN cell can be modeled by a LoG that acts directly on the visual stimulus and highlights the contours. Our aim is to build a CNN architecture that mimics this behavior in order to strengthen the links between CNNs and the visual system.
Since the LGN preprocesses the visual stimulus before it reaches V1, we should add a first layer at the beginning of the CNN that reproduces the role of the LGN. It should apply to the image a LoG highlighting the contours of the objects, indeed without modifying in any way the dimensions of the input. Thus, the idea is to introduce a first convolutional layer containing a single filter that eventually should obtain a LoG shape.
The theoretical idea behind this structure can be found in a simple result on rotationally symmetric functionals. In particular, if we have a rotationally symmetric functional that has a unique minimum then is also rotationally symmetric. Indeed, since is rotationally symmetric, for a rotation . Thus, since the minimum is unique, and this implies the rotation symmetry of the solution. There are several results on symmetries of minimum for functionals as for example in [14], [8]. Our aim is to extend these results in the case of CNNs in particular on our architecture that we will call as Lateral Geniculate Nucleus Convolutional Neural Network (LGNCNN).
We have also studied the modifications that occurs to the filters in the second convolutional layer in our architecture. In particular, we have tried to compare their shape with respect to the RPs of a macaque’s V1 from the work of Ringach ([19]).
In his work, Ringach has recorded the RPs of a macaque’s V1 and has fitted them with a Gabor function defined as follows:
(1) 
where is translated and rotated from the original coordinate system
Then he has compared his results with respect to the filters obtained using the Independent Component Analysis and the Sparse Coding. These are the main steps he followed:

Recording the RPs from several simple cells in V1;

Fitting a Gabor function defined in equation (1) to the RPs;

Comparing the results on plane.
Thus, we have followed the same steps by approximating the filters obtained after the training phase with a Gabor function (1) and plotting them on the plane. Indeed, we can compare the elongation in the and direction which characterizes the RPs of simple cells in V1. This analysis should enforce the link between our architecture and the visual system structure, at least as regards simple cells in V1.
3.3. The layers that compose LGNCNN
Let us formally describe the architecture of LGNCNN. This net takes as input an image modeled as a function , where is a square of size
, and tries to classify it between
different categories. During the entire description we assume that the input of each layer is a tensor
of size where is the spatial dimension and matches the number of filters of the last convolutional layer.The th convolutional layer is a functional that acts on its input by applying linear operators defined as follows
where is a cube of size . Indeed each filter is a tensor of size , obtained after training. Thus the convolutional layer acts in the following way on its input :
where is the cubic neighborhood of of the same size of the filter. Let us note that the first convolutional layer is composed of only one filter of size
. After each convolutional layer we will apply a RELU
which is a function that applies elementwise to the input and let all negative numbers be zero:After each convolutional layer and the RELU, with the exception of the first layer , we apply a pooling that splits the output of the RELU in squares of size (usually or ) in the spatial dimensions and shrinks each of them to a single value, usually the maximum or the average of elements in the square.
Formally, if denote the th the set square of the decomposition of the input , then the maximum pooling and the average pooling are defined as
Note that after the first layer we will not apply any pooling. In this way taking a classical CNN and adding will not modify the structure of the net; this is a great advantage since we can compare easily the two architectures. Indeed we will not add complexity to the problem since the number of parameters will only increase of . Furthermore, will prefilter the input image without modifying in any way its dimension; this behavior mimics the behavior of the LGN which let the net to be closer to the visual system structure.
Thus, at the end of the net, after all convolutional layers, RELUs and poolings, there is a fullyconnected () layer which modifies the size of the input eventually matching the number of categories . In particular, the layer is a functional that acts in the same way as the convolutional layer, indeed it is composed by linear operators. The main difference is that each filter approximated by a tensor has the same dimension of the input. Thus the output of the
layer is a vector of length
; in this way, if we set we obtain a vector that is long as the number of categories .Let us note that however there are no constraints about the values that the layer’s output
can obtain. Since we would like to classify an image, the output of the net should represent a probability distribution. Thus we will apply a softmax
that modifies the values in the following way:(2) 
Then we obtain a probability distribution over the set of categories since and . Thus we have built a function that takes an input image and returns a probability distribution over a set of categories
The net selects the category that most likely is associated to , indeed
A good classifier should have good performance on a test set. But how can we determine all the parameters in the net?
3.4. The loss function
So far we have described the layers that compose the structure of a net but we have not said anything about how we can obtain all the parameters. First of all let us outline that all the parameters of the net come from the convolutional layers and the layer since the pooling and RELU layers do not have any parameter to fit. Then if is the label functional defined on a set of images that well classifies all the image in the training set
, we could define a loss function in order to find a functional
that well classifies the images in and eventually also other images (say for example the images in a test set). The loss function we have used is the following:(3) 
If the loss function is small over a training set it means that the net is able to perform the classification task. Then our objective is to minimize the loss function on a training set and, applying a backpropagation algorithm, determine the parameters of the net. This procedure is repeated several times in order to obtain a better approximation of the parameters and having better performances on a test set.
4. Applications of LGNCNN
]
4.1. Settings
In this subsection we are going to describe the settings in which we have tested our architecture. We have used MATLAB2019a for academic use and the software MatConvNet (see [24]).
Since we would like to obtain a rotationally symmetric filter we have decided to train our LGNCNN on a dataset of natural images called STL10 (see [1]) that contains 5000 training images divided in 10 different classes. Indeed using a dataset that contains for the most part few directions (say for example most vertical lines) could have been leading to a directional filter in the first layer of our architecture. We have modified the images in the following way: we have converted them from RGB color to grayscale color using the builtin function rgb2gray of MATLAB; we have rotated them randomly by 0, 90, 180 and 270 degrees. This was due to the fact that we would like to prevent any possible shift of the center of the filter in the first layer. Indeed, most of images contains the sky which is on the top of the image and could possibly shift the center of the filter towards the bottom. Furthermore we have chosen this dataset since the images’ size is ; in fact we decided to use quite large filters in the first and second layer ( and respectively) in order to obtain more information about their shapes.
Figure 3 shows the architecture of our CNN. The input is an image of size . Then there is the first convolutional layer composed by only of size followed by a RELU. Thus the second layer composed by 64 filters of size
receives as input a matrix of the same size of the image. Note that the stride is 2 and this is why the spatial dimensions half. After the net applies a RELU and a max POOLING with squares of size
. The third and last convolutional layer is composed by 32 filters of size and it is followed by a RELU and an avg POOLING with squares of size . Eventually two fullyconnected layers are applied giving as output a vector of length 10. Here the net applies a softmax defined in equation (2) in order to obtain a probability distribution over the 10 classes. The functional that models this net is the following(4) 
4.2. The first layer of the net
After the training phase we can analyze the net we have obtained. In this subsection we focus on the first layer. Figure 4 shows the filter and the comparison with respect to minus the LoG. Let us note that figure (a)a is the actual result of the net after training which has positive value in the center and negative ones around it. We can see that figure (b)b which shows minus the Laplacian of Gaussian has a shape really close to . Moreover figure (a)a shows the actual filter obtained after the training phase. Then in figures (b)b and (c)c we compare a approximation of and minus the LoG in which the rotationally symmetric pattern is clearer. Let us note that the filter attains this shape without any constraint except the structure of the net itself. This suggests that choosing the net architecture influences directly the filters we will obtain, letting us to link the net architecture to the cortical layers structure.
4.3. The second layer of the net
Since we would like to enforce the link between our architecture and the structure of the visual system we have decided to study the filters in the second layer comparing them with some real data obtained on monkey in [19]. Therefore, we have trained two different CNNs, a LGNCNN defined by the functional (4) and a classical CNN defined by the functional (5) in which we have eliminated the first convolutional layer and its following RELU , characteristic of our architecture.
(5) 
Let us note that in both architectures contains filters with Gabor shapes after training. This is a well known results on the filters of the first convolutional layer of CNNs (e.g. in [21], [26]); however, the introduction of a first layer composed by a single filter does not change this behavior, enforcing the link of our architecture and the visual system structure. Indeed, we have studied the statistical distribution of these banks of filters confronting the results with the real data of Ringach.
We have approximated the filters in the banks using the function (1); figure 6 shows some of the filters of LGNCNN and their approximation. We can define where and estimate the elongation in and directions respectively thanks to and . They are rescaled by which indicates how far the shape of the filter is with respect to a Gaussian; in particular, if the function in (1) simplifies to a Gaussian since the cosine becomes a constant. In order to better compare the plots we looked for the distribution that best fits the neural data. In particular, it approximates the points closer to the origin with a line and then it approximates the rest of the points with a line starting from the previous one.
Figure 7 shows the three plot we would like to compare. Let us note that introducing LGNCNN modifies the elongation of Gabor filters in . In particular, in classical CNN the filters are often more elongated in the
direction as we can see from the slope of the interpolating line in figure
(a)a. In figure (b)b we can see that the slope changes greatly and that the filters become much more elongated in the direction. This behavior is the same in the case of RPs (figure (c)c) in which the distribution has a similar slope of LGNCNN. This enforces more the link of LGNCNN with the structure of the visual system motivating us to pursue in this direction.5. Rotation symmetry of : a possible setting
]
In this section we define a setting in which it should be possible to study the rotation symmetry of . The idea is to extend some results on rotation invariant functionals (see [14]).
Let us consider the architecture of a LGNCNN in which we can split the first convolutional layer composed by only one filter from the rest of the net which will be fixed. Thus this first layer can be approximated by a function , assuming . A general image can be defined as a function where we assume . Thus the rest of the net will be defined by a functional
And then we can define
(6) 
Then is a probability distribution over a set of categories. Thus our aim is to find a function in such a way that approximates well the known functional which is defined on a subset of all the images.
Since we can impose that the output of has the following properties:
The idea is to adapt the proof the authors obtained in [14] for a certain class of functionals to our case.
6. Conclusions
]
The study of the role of the LGN in the visual system and the rotation invariance properties of the RPs of its cells has leaded our research to the introduction of a CNN architecture that mimics this structure. In particular, we have added to a CNN a first convolutional layer composed by a single filter which attains a rotationally symmetric pattern. The filter has inherited this property from the modified architecture of the net.
We have also shown that it is not only rotationally symmetric but it also obtains a LoG shape that approximates the RPs of the LGN cells. This behavior enforces the link between the visual system structure and the architecture of CNNs. Furthermore, we have analyzed the statistical distribution of the filters of the second convolutional layer that attain a Gabor shape even with the introduction of the first layer. We have shown that the statistical distribution becomes closer to the real data of RPs of simple cells in V1 (from [19]) enriching the connections with the neural structure.
In the future we will face the theoretical problem regarding the rotation symmetry of the first convolutional layer. Furthermore, we will analyze the modifications that in a LGNCNN occur to the bank of filters of other convolutional layers of deeper architecture, comparing them with neural data.
References
]
B. Fasel and D. GaticaPerez. Rotationinvariant neoperceptron. In Proc. International Conference on Pattern Recognition (ICPR), vol. 3. IEEE, pp. 336339, 2006.
A. Sherstinsky. Fundamentals of Recurrent Neural Network (RNN) and Long ShortTerm Memory (LSTM) Network. CoRR, 2018.
D. Yamins and J. DiCarlo. Using goaldriven deep learning models to understand sensory cortex. Nature Neuroscience, Vol 19, 356 EP, 2016.
Comments
There are no comments yet.