Revealing Fine Structures of the Retinal Receptive Field by Deep Learning Networks

11/06/2018 ∙ by Qi Yan, et al. ∙ University of Leicester Microsoft Tsinghua University Peking University 0

Deep convolutional neural networks (CNNs) have demonstrated impressive performance on many visual tasks. Recently, they became useful models for the visual system in neuroscience. However, it is still not clear what are learned by CNNs in terms of neuronal circuits. When a deep CNN with many layers is used for the visual system, it is not easy to compare the structure components of CNN with possible neuroscience underpinnings due to highly complex circuits from the retina to higher visual cortex. Here we address this issue by focusing on single retinal ganglion cells with biophysical models and recording data from animals. By training CNNs with white noise images to predict neuronal responses, we found that fine structures of the retinal receptive field can be revealed. Specifically, convolutional filters learned are resembling biological components of the retinal circuit. This suggests that a CNN learning from one single retinal cell reveals a minimal neural network carried out in this cell. Furthermore, when CNNs learned from different cells are transferred between cells, there is a diversity of transfer learning performance, which indicates that CNNs are cell-specific. Moreover, when CNNs are transferred between different types of input images, here white noise v.s. natural images, transfer learning shows a good performance, which implies that CNN indeed captures the full computational ability of a single retinal cell for different inputs. Taken together, these results suggest that CNN could be used to reveal structure components of neuronal circuits, and provide a powerful model for neural system identification.



There are no comments yet.


page 1

page 2

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep convolutional neural networks (CNNs) have been a powerful model for numerous tasks related to system identification in recent years [1]. By training a CNN with a large set of target images, it can achieve the human-level performance for visual object recognition. However, it is still a challenge for understanding the relationship between computation and underlying network structure components learned within CNNs [2, 3]. Thus, visualizing, interpreting, and understanding CNN are not trivial [4].

Inspired by neuroscience studies [5], a typical CNN consists of a hierarchical structure of layers [6], where one of the most important properties for each convolutional (conv) layer is that one can use a conv filter as a feature detector to extract useful information from input images [7, 8]. Therefore, after learning, conv filters are meaningful. The features captured by these filters can be represented in the original natural images [4]. Often, one typical feature shares some similarities with part of natural images from the training set. These similarities are obtained by using a very large set of specific images with reasonable labels. The benefit of this is that features are relatively universal for one category of objects, which is good for recognition. However, it also causes the difficulty of visualization or interpretation due to the complex nature of natural images, i.e., the complex statistical structures of natural images [9]. As a result, the filters learned in CNNs are often not obvious to be interpreted [10].

On the other hand, researchers begin to adapt CNNs for studying the central questions of neuroscience [11, 12]. For example, CNNs have been used to model the ventral visual pathway that has been suggested as a route for visual object recognition starting from the retina to visual cortex and reaching inferior temporal (IT) cortex [13, 14, 15, 12]. The prediction of neuronal responses, in this case, has a surprisingly good performance. However, the final output of this CNN model is representing dense computations conducted in many layers, which may or may not be relevant to the biological underpinnings of information processing in the brain. Understanding these network components of CNN is difficult given that the IT cortex part is sitting at a higher level of our visual system [12].

Fig. 1: Illustration of biophysical model and CNN model to study the retinal computing. (A) Retinal circuit computes its output as a sequence of spikes for each RGC when visual scenes are received by the eyes. (B) Illustration of RGC model structure used in the current paper. Simplified neuronal circuit of a single RGC can be represented by a biophysical model that consists of a bank of subunit linear filters and nonlinearities. Note there are four subunits playing the role of conv filters. (C) Illustration of CNN model structure. CNN is used to train the same set of stimulus images to predict the spikes of all images for both biological RGC data and biophysical model data.

In principle, CNN models can also be applied to early sensory systems where the organization of underlying neuronal circuitry is relatively clear and simple. Thus one expects knowledge of these neuronal circuits could provide useful and important validation for CNN. Indeed, a few studies applied CNNs and their variations to earlier visual system, such as the retina [16, 17, 18, 19], V1 [20, 21, 22, 23, 24, 25, 26] and V2 [27]

. Most of these studies are driven by the goal that the better performance of neural response can be archived by using either feedforward and recurrent neural networks (or both). These new approaches increase the complexity level of system identification, comparing to conventional linear/nonlinear models 

[28, 29, 30]. Some of these studies also try to look into the details of network components after learning to see if and how they are comparable to the biological structure of neuronal networks [19, 24, 26].

The retina, compared to other earlier visual systems, has a relatively simple neuronal circuit with three layers of neurons as photoreceptors, bipolar cells and ganglion cells, together with inhibitory horizontal and amacrine cells in between as illustrated in Fig. 1(A). The retinal ganglion cells (RGCs), as the only output neurons of the retina, send visual information via the optic tracts and the thalamus to cortical areas for higher cognition. Each RGC receives input from a number of excitatory bipolar cells (BCs) as driving force to generate spikes, which traditionally is modeled by a biophysical model with a number of filters and nonlinearities as in Fig. 1(B) [28, 31, 32]. Thus, it serves as a typical model for both deciphering the structure of neuronal circuits [33, 34, 35, 36, 37, 38] and testing novel methods for neuronal coding [29, 39, 28]. In this study, we use the retina to show what could be learned by CNNs. Unlike these previous studies focusing on a large population of retinal ganglion cells [19, 24], here we take a different viewpoint by modeling single RGC with CNN as illustrated in Fig. 1(C). In this way, one can use the single cell to reveal the details of the receptive field of visual neurons [26].

Our aim is to study what kind of possible biological structure components in the retina can be learned by CNN, and how one can use CNN for understanding the computations carried out by single retinal cells. These questions concern the research focus of understanding, visualizing and interpreting the CNN components out of its black box.

Like those typical existing biophysical RGC models based on interpretable biophysical proprieties of the receptive filed measured experimentally [29, 39], we found our CNN model is also interpretable. By using a minimal biophysical model of RGC, we found the conv filters learned in CNN are essentially the bipolar subunit components of RGC model. Furthermore, we applied CNNs to analyze biological RGC data recorded in the salamander. The conv filters are resembling the receptive fields of bipolar cells that sit in the previous layer of RGC and pool their computations to a downstream single RGC. Such a fine structure picture of the retinal receptive field revealed by CNN suggests that convolutional filters learned are resembling biological components of the retinal circuit. Thus a CNN learning from one single retinal cell reveals a minimal neural network carried out in this cell.

In addition, different RGCs were trained to get their corresponding CNN models, then we transferred these CNNs between these RGCs to test the ability of transfer learning. There is a diversity of transfer learning performance. In general, the CNN models have a good performance in transfer learning. However, the best performance is still the one trained with its own original data, which implies that the CNN model is cell-specific with a set of filters inherited from the target RGC. Furthermore, when CNNs are applied to a different input domain, here white noise images v.s. natural images, they can capture the meaningful responses of the RGCs in terms of the proper number of spikes and right spike timings of spikes, even in the cases where there is no spike for some specific images. This implies that CNN indeed captures the full computational ability of one cell for different inputs.

Some preliminary results of this study were presented in a NIPS workshop short communication [40].

Ii Methods

Ii-a Biophysical RGC model

A biophysical RGC model as in Fig. 1 (B) was modeled as a typical subunit model used previously [41, 28]. The model cell has four subunits with a spatial filter of the size 2x2 pixels, similar to a conv filter of CNN but only sitting at a specific spatial location, and a temporal filter to take into account of temporal dynamics. Each subunit convolves the incoming stimulus image and then applies a nonlinearity of threshold-linear rectification. The output subunit signals are then polled together by the RGC. The polled signal is applied with a threshold-linear output nonlinearity with a positive threshold at unity to make spiking sparse. Thus, with this model, a given sequence of stimulus consisted of white noise images with the size as 8x8 pixels can generate a train of spikes.

Ii-B Biological RGC data

A public dataset of RGCs recorded in salamander as described in [42, 18, 31] was used for CNN modeling. Briefly, a population of RGC spiking activities was obtained by multielectrode array recordings as in [30]. The retinas were optically stimulated with spatiotemporal white noise images, temporally updated at a rate of 30 Hz and spatially arranged in a checkerboard layout with stimulus pixels of 30x30 . The recording time is about 4 hours so that there are enough stimulus images and spikes for training a CNN model. A dataset of 300 natural images was also used as the stimulus for studying natural image responses [42, 31].

Ii-C CNN RGC model

We used a naive CNN model containing a different number of convolution layers and a single dense layer. To model biophysical RGC model data, the conv filter size was fixed as . Both the number of filters and layers were tested.

To model biological RGC data, several sets of parameters in convolution layers, including the number of layers, the number, and size of convolution filters were explored. The prediction performance is robust against these changes of parameters. Therefore we adopted the filter size of in the first conv layer and in the second conv layer.

For training CNN with biophysical RGC model data, we generated a data set consisting of 600k training samples of white noise images, and an additional set of 10k samples for testing. The training labels are a train of binary spikes with 0 and 1 generated by the model.

For biological RGC data recorded in the salamander, there are about 320k training samples of white images and labels as the number of spikes as in [0 5] for each image. The test data have 300 samples, which were repeatedly presented to the retina for about 200 trials.

The average firing rates of biophysical and biological test data were compared to the CNN output for calculation of the Pearson correlation coefficient (CC) as a performance measure. A Poisson loss is used to optimize the CNN output to match the spiking labels. The final nonlinearity after the dense layer is a standard soft-plus function.

For preprocessing of data, a standard technique of spike-triggered average was applied to get a 3D spatiotemporal receptive field filter [29]

. The singular value decomposition of this 3D filter yields a temporal filter and a spatial receptive field 


Two versions of CNNs were used. (Version I) The first version of CNN has only spatial filters without temporal filters to be fitted. For this, data was temporally correlated first by convolving every pixel of the whole set of stimulus images with the temporal filter first along the temporal dimension [44, 31]. In this way, a sequence of spatial images was obtained as inputs with the corresponding spike train as output labels for CNN model, such that this CNN makes analysis focusing on the spatial structure of receptive fields. In addition, this CNN has much fewer parameters, for example, when the temporal filter of interest is lasting for 600 ms with 30 Hz, then CNN parameters are 20 times less. (Version II) The second version of CNN is the full model with both spatial and temporal filters to be learned from the RGC data. This usually results in a large number of parameter that causes unavoidable problems for traditional statical models [39, 28]. The recent advancements of deep learning make it relatively easier to fit the data with high-dimensional parameters. Both versions of CNNs were used and compared to see the effects on the performance.

When CNN is used for object recognition task by learning a set of natural images, it is important to visualize what kind of features of natural images are learned by conv filter [4]. Similarly, here one can also visualize the image features represented by each CNN conv filter. The feature here means the response-weighted average feature, which is the spike-triggered average generated by a batch of random noise stimulus and its average activation from the corresponding feature map in the first layer.

Fig. 2: Subunit structure of RGC model data revealed by CNN. Visualizing conv filters learned in a CNN with one layer of convolutional filters. The number of conv filters is from 1 to 5, where both spatial and temporal filters are learned by CNN. (Inset) spatial receptive field and temporal filter of the modeled RGC computed by spike-trigger average (left) and CNN (right). Note the filter size is , and the full size of the receptive field is .

Fig. 3: CNN performance saturated when there are more conv filters and layers. Data points represented by 10 runs with mean standard derivation (STD).

Fig. 4: Subunit structures of biological RGC revealed by CNN. (A) Receptive fields of one RGC computed by STA and CNN prediction. (B) Visualizing CNN model components of conv filters and average features represented by each filter. (C) Neuronal response predicted by CNN visualized by RGC data spike rasters (upper), CNN spike rasters (middle), and their average firing rates (bottom).

Fig. 5: Pruning convolutional filters in CNN. (Left) CNN model performance (CC) maintained at a similar level after pruning CNN by using only a subset of effective (non-zero) filters as in Fig. 4. (Right) CNN performance dropped to zero when pruning CNN with the same number of parameters but a randomly selected subset of filters. Error bars indicate 10 random prunings (meanSTD).

Iii Results

By using both clearly defined biophysical model and real retinal data, we show that CNN is interpretable when single RGCs were modeled with the benefit to clarify what has been learned in the network structure components of CNN. Recently, a variation of non-negative matrix factorization was used to analyze the RGC responses to white noise images and identify a number of subunits as bipolar cells of one RGC [31]. With this picture in mind, here we address the question that what types of network structure components can be revealed by CNN when it is used to model the single RGC response.

Iii-a Subunits of modeled RGC as CNN filters

We set up a biophysical RGC model with four subunits as in Fig. 1(B), which is resembling a 2-layer network with one layer of subunits and one layer of single RGC. By using a set of white noise images, the model generated a sequence of spikes to simulate a minimal neural network of the retinal ganglion cells. With the input of stimulus images and the output of RGC spikes, we can train a CNN as in Fig. 1(C) to predict the simulated spikes generated by the RGC model.

Give our focus here is looking into the structure of network components learned in CNN, we varied a number of parameters to train the CNN, in particular, the number of conv filters from 1 to 16. When only one conv filter is used, the learned filter has a similar structure as the receptive field of the modeled RGC as in Fig. 2 that can be obtained by the standard method termed spike-triggered average [29] (see Methods). When there are more conv filters, there is a rich zoo of the fine structure of the receptive field as filters learned by CNN. We found that when training the CNN with four conv filters, the outcome filters resemble the subunits used in the biophysical RGC model. The outcome filters learned by CNN are convergent in the sense that there are only four “effective” filters similar to the model subunits, the rest of the filters is noise, for example when there are five conv filters in Fig. 2. This observation is similar to a recent study where non-negative matrix factorization was used for identifying the subunits [31].

The nature of the black box of CNN often forces researchers to tune the parameters of CNN according to the performance that is usually the accuracy of the tasks, such as image classification. Here, the performance is then the correlation between CNN output and modeled RGC spiking response. Not surprisingly, CNN can give a good performance for predicting the RGC response, which is consistent with the previous studies of various types of visual neurons [16, 12, 24]. More importantly, here we also found when there is enough number of conv filters, increasing the number of conv filters does not make the performance better as shown by an evolution of filter change with an increasing number of filters as in Fig. 3. The performance is convergent when the number of conv filter reaches 5.

In addition, the number of conv layer is tuning the CNN performance to reach the saturation level. There is no difference when there are two layers or three layers of CNN conv filters. When there are 2 or more layers of conv filters, the performance is stabilized independently of the number of conv filters used. In either case, when there is only one conv filter used, it yields a result where the receptive filed of RGC is learned as one conv filter of CNN.

Therefore, these results show that increasing the number of conv filters and layers does not increase the performance when enough components are used to capture the underlying biophysical properties of RGC model. In other words, CNN parameters could be highly redundant when setting up to a larger number. Such a redundancy of parameters is widely observed for deep learning models [45, 46]. This point will be shown by the biological data below in details.

Altogether, These results suggest that CNN can identify the underlying hidden network structure components within the RGC model by only looking at the input stimulus images and the output spiking response.

Fig. 6: Effects of training sample size and noise on the performance of CNN. (A) Effect of sample size changing from 1% (0.01) to 100% (1) of the whole training set images on the performance of CNN (CC, upper) and the loss of training (Loss, bottom), where 1% of training data has about 3.2K images and 0.15K spikes (3.05K of non-spiking labels as zero) due to the spare firing property of RGC. Data point of Ai is shown in (Ai-Aiii). (Ai) CNN output v.s. test data firing rate. (Aii) receptive field of data (left) v.s. CNN prediction (right). (Aiii) conv. filters and features leaned by CNN. (B) Similar to (A) but for the effect of noise in sample images, where the ratio of noise is changing from 0% (without noise) to 1 (100% of noise without data). (Bi-Biii) Similar to (Ai-Aiii).

Iii-B Subunits of biological RGC as CNN filters

To further characterizing the structure components of CNN in details, we use CNNs to learn the biological RGC data with similar images of white noise and spiking responses. We first use CNN to study temporally correlated RGC data where no temporal filter is needed to be learned by CNN (see Version I CNN model in Methods) with the benefit of fewer parameters of CNN to be learned. Similar to the results of the RGC model above, the outputs of CNN model can recover fine structures of the receptive field of RGC data very well as in Fig. 4(A). We also found that the learned conv filters converge to a set of localized subunits whereas the rest of filters are noisy and close to zero as in Fig. 4(B). The size of these localized filters is comparable to that in bipolar cells around 100 [31].

In addition, the features (see Methods) represented by these localized conv filers are also localized. Given the example RGC is an OFF type cell that responses to the dark part of images strongly, most features have similar OFF peaks resulted from the OFF BC-like filters. These OFF features tile the space of the receptive field of RGC. Interestingly, there are some features with ON peaks, which play a role as inhibition in the retinal circuit. A few features have some complex structures mixed with OFF and ON peaks, which are mostly resulted from the less localized filters. However, if the filters are pure noise, the resulting features are pure noise without any structure embedded. Besides filters and features, the CNN model generates a good prediction of RGC response as in Fig. 4(C). These observations are similar across different RGCs recorded.

When there are 32 conv filters used in CNN, there is enough number of conv filters to fit the data. Given there are many redundant filters (minimal 16 filters close to zero in Fig. 4 (B)) unused in CNN learning, they unlikely reflect any interesting biophysical properties of RGC data, so could be no contribution for the performance of CNN. Indeed, CNN performance maintains at a similar level when the conv filters are pruned such that the only effective filters are used in CNN. The selection of effective filters was quantified the spatial autocorrelation of conv filters [31]. The pruning results of a population of RGCs show that the performance is similar to a small subset of effective conv filters in Fig. 5. When pruning is done with randomly selected conv filters out of 32, the performance drops to zero as in Fig. 5 (right).

These results confirm the observation shown in the biophysical model RGC, where the CNN performance is saturated with more parameters than enough. In terms of biological RGCs, the subunits are the upstream bipolar cells [31]. Thus, the conv filters play a functional role as bipolar cells when using CNN to model biological RGCs.

Fig. 7: CNN transfer learning across different cells. (Upper) Four example cells used to train the CNN models. Learned CNNs are then transferred across different RGCs for prediction. Over this 4x4 matrix, the diagonal ones are the cases of self-training (Target to Target) with each cell’s data. Cell 1 is the same cell as in Fig. 4. CNN 1 is the model obtained by cell 1. The first row represents that CNN 1 model is transferred to predict the test data of Cell 2-4. The first column represents that three CNN models (CNN 2-4) from the other three cells (Cell 2-4) are transferred to predict the test data of Cell 1. (Bottom) Performance of this 4x4 matrix. The first four points are calculated for cell 1 from CNN 1-4. Different CNNs are colored in different colors.

Next, we examine the effect of sample size and noise on CNN model. For the same cell shown in Fig. 4, the size of samples/images was changed in a wide range from 1% to 100%, in this way, the corresponding number of spikes recorded in RGC is reduced to about 150 spikes with 1% of training data. Both CNN performance and loss after training are dependent on how much data used for training as in Fig. 6(A). Surprisingly, with only about 150 spikes from 1% of data, we can still obtain some level of performance with CC about 0.35 as in Fig. 6(Ai). Although the receptive field with such small amount of spikes is not good (Fig. 6(Aii)), the performance of CNN does not drop that much (CC is 0.35 v.s. 0.75, but with 150 v.s. 15K spikes). Thus, CNN seems to need only a relatively small set of spikes for training to get a reasonable performance. Note with 30% of data (4.5K spikes), the performance is almost similar to the full data. In this sense, the CNN seems to be much less data-demanding than traditional biophysical RGC models [28, 39]. The resulting filters and features of CNN are also worse, however, the spareness of filters still holds although 32 filters are used.

In contrast to the sample size, the noise has a much larger effect on the CNN model as in Fig. 6(B). Keeping the sample size unchanged, we replaced part of data images with irrelevant noise images, for instance, 70% of noise means there are 30% of data and 70% of noise images. Although the data percentage is the same as 30%, the CNN trained with noise has a much worse performance (CC 0.4) comparing to the same amount of data without noise (CC 0.7). Fig. 6(Bi-Biii) shows an example result of training CNN with complete noise images. Even in this case, the redundancy of filters makes most filters close to zeros.

Fig. 8: CNN transfer prediction for a population of 10 RGCs. (A) Performance matrix of 10 CNNs across different RGCs, where the diagonal ones are self-training (Target to Target, purple), each row is transferring the CNN trained by target cell to all other cells (Target to Others, fix-CNN, green), and each column is transferring all other CNNs to the target cell (Others to Target, fix-data, light blue). The first four cells are the same ones as in Fig. 7. The rows and columns of performance matrix have the same meaning as in Fig. 7. (B) Performance matrix shown as a scatter plot of self-training v.s. fix-CNN. (C) Performance matrix shown as a scatter plot of self-training v.s. fix-data.

Fig. 9: Spatial and temporal filters of biological RGC data revealed by full CNN model. (A) Receptive fields as spatial STA of the example cell and CNN prediction. Temporal filters are also recovered by CNN. (B) Visualizing CNN model components of both conv filters and average features represented by each filter in both spatial and temporal dimension. (C) Neuronal response predicted CNN visualized by RGC data spike rasters (upper), CNN spike rasters (middle), and their firing rates. (D-F) Transfer prediction by full CNN models for a population of 10 RGCs. All plots have the same meanings as Fig. 8. (G) CNN prediction improved with temporal filter included, but transfer prediction is worse in general. Performance (CC) matrix shown as a scatter plot of CNN without temporal filter (Fig. 8) v.s. full CNN with temporal filter. Black indicates transfer prediction, red indicates target prediction.

Iii-C Transfer learning across different RGCs

Given there is a population of RGCs recorded experimentally and then modeled by CNN, one can study the behavior of transfer learning, or the generalization ability of CNN model, i.e., using the CNN model learned from one RGC to predicate the response of another RGC. Several scenarios of transfer learning are commonly used in deep learning, including using a pre-trained model directly, fine-tuning a pre-trained model, or fixing features of a pre-trained model but adjusting dense layer [47]. Here we took the approach to use a pre-trained CNN directly from one RGC to other different RGCs. This is suitable for our experimental setup, as a large full-size of white noise images were presented to all RGCs of a population at one time. Different RGCs are sitting at different spatial locations of images, therefore they are seeing parts of the whole image. However, due to the nature of white noise images, the statistics of the ensemble input images are the same, or at least closely similar to Gaussian, across different RGCs. Thus, one expects that the transfer learning of CNN model, in this case, has a good performance.

However, we found there is a large diversity of transfer learning performance across different RGCs as shown in Fig. 7, where there are four example cells showing their CNN model predictions (diagonal traces labeled as “Self”) and the corresponding transfer learning behaviors (off-diagonal traces). The CNN performance characterized by CC shows the tendency that the best performance is always from the CNN model trained with their own cell. For instance, cell 1 has a good test performance with its own CNN model, denoted as CNN 1. When CNN 1 is transferred to the other three cells, the prediction power is lower than cell 1. However, the CNN 3 is transferred to cell 1 with a better performance of CC=0.64 comparing to CC=0.56 for cell 3 itself. Interestingly, all CNN models have a reasonably good performance for cell 1, even though that CNN 4 has a fairly good performance for cell 4 itself.

To further investigate the ability of transfer learning of CNN model, we collect a population of 10 RGCs (including those four cells in Fig. 7). The population plots in Fig. 8 confirm the observation above. The scatter plot of self-learning (Target Target) v.s. transfer-learning (Target Others) in Fig. 8 (left) shows that the CNN model has a reasonable good performance for both self-learning and transfer-learning, yet the results are quite diverse. Note that even for the worst cell with lowest CC in self-learning, when its CNN is transferred to other cells, its CNN has a better performance.

Similarly, as above, the performances are always lower when other CNNs are transferred to the target cells themselves (Fig. 8 (right)). This indicates that each CNN model is really optimized for the target cell after training. For the best cell who has highest CC by CNN, other CNNs trained with other cells also have good performance in general. In contrast, for the worst cell who has lowest CC by CNN, other CNNs trained with other cells also have bad performance.

These results suggest that the CNN learned from each cell is very specific to that particular cell. The resulting CNN, therefore, learns to obtain a minimal neural network that carries out the essential computations done by that cell.

Fig. 10: Transfer prediction of natural images by CNN trained with white noise images. (A) Performance of CNN for the example RGC (green cell in (B)) with 50 images (one image per second). (B) Performance of CNN with four example images and three RGCs. (top ) five images overlaid with the outlines of the receptive field of three RGCs colored in green, light blue and yellow. (bottom) spike response of each GC for each image (blue) together with spikes sampled from CNN output (red). Each image was presented for 200 ms long (colored shadow window) then followed by 800-ms gray period. Note RGC response is delayed after the onset of an image. Each image triggers one RGC in a different manner with spiking or non-spiking depending on the texture of the image and specific RGC.

Iii-D Full CNN model for biological RGC data

The results above on biological RGC data were studied by a CNN without temporal filter learned. Now we consider the full CNN model (Version II CNN model, see Methods), where both temporal and spatial filters are needed. Similar to the RGC model data, CNN can recover both spatial filters as receptive field and temporal filters as shown in Fig. 9 (A). With the full CNN model, there are 20 times more parameters than that of temporally correlated data. Both conv filters and features visualized in CNN have still a located spatial structure and good temporal filer shape for a subset of filters. Not surprisingly, the prediction of CNN for neuronal response has a good performance as well.

Similar to Fig. 8, we also tested the transferring learning ability of the full CNN model for the same population of 10 RGCs. The results in Fig. 9 show a similar tendency of transferring learning for the CNN model. Yet, there are some differences between the two versions of CNN models, which can be seen by comparison in Fig. 9. For the target case, where each CNN was trained by using that particular RGC data, the full CNN yields a better performance than the CNN without temporal filter learned. However, when full CNNs are transferred between different cells, their performance is in general worse than reduced CNNs. That indicates that the full CNN with both spatial and temporal filters is more specific to the particular cell used for training. In turn, such a cell-specific CNN can not be used to explain other cells. Therefore, this confirms the result that CNN indeed learns the underlying neuronal computation carried out by the biological cell, which can not be transferred across different cells.

Given there are 20 times more parameters in the full CNN, the conv filters have less clearly localized structures than those in the CNN of reduced temporal correlations when comparing the filters in Fig. 4 and Fig. 9. This seems to be caused by the limited sample size of biological RGC data. When the biophysical RGC model is used for both versions of CNN, there is no difference in terms of the structure of conv filters, see the results shown in [40]. Therefore, depending on the questions to be addressed, one may want to choose the simple or full version of CNN for modeling static images or dynamical movies, as movies contain strong temporal correlations that have a strong impact on the adaptation of neuronal dynamics [30].

Iii-E Transfer learning between different stimulus images

Above we tested transfer learning of CNNs across different cells. Here we further test transfer learning of CNN from each cell but for different input images. So far, the CNN model of each GC trained by white noise images, one can test the ability of transfer prediction by directly applying learned CNN to natural scenes.

For this, a sequence of 300 natural images was presented to the retina, where each image was briefly presented for 200 ms and followed by 800 ms empty scene. Such a protocol of stimulation is to leaving out temporal adaption for each image (see details in [42]). Therefore, we used a temporal decorrelated CNN model (Version I) to fit the neuronal response of this sequence of flashing images.

Surprisingly, CNN learned by white noise images can predict the spiking response of natural images quite well. Fig. 10(A) shows the result of one example cell. For clear illustration, we only show a partial sequence of 50 images. The detailed result is demonstrated with five images together with three cell in Fig. 10 (B). Each RGC reads a different part of images as the receptive field is located differently. Interestingly, CNN predicts the responses well when all three cells fire for image 1, in addition, CNN can also predict when one of three cell is not firing with spikes for image 2-4. Even when all three cells are not firing completely for image 5, the outputs of CNN are also silent.

These resulting of transfer learning across image domains, compared to those across different RGCs, suggest that CNN intends to learn a cell-specific neural network in which convolutional filters play a role as upstream subunit cells that connect to a particular RGC. Thus, CNN serves a model of neural system identification to reveal the underlying computations of the retinal ganglion cells.

Iv Summary & Discussions

In recent years, system identification of neural coding based on neural networks has been greatly improved by the development of deep convolutional neural network [1, 12, 11], where multiple layers and huge banks of local subunit filters within the network are significant characteristics. Besides its powerful ability to many practical tasks [1], the underlying structure is mimicking a hierarchical organization of the brain [12, 11]. However, there is no first principle about designing the structure of a hierarchical deep learning network [2]. Recent works begin to look into the details of network structure, in particular, with the potential connections to the biological brain [12, 3].

Hereby focusing on single retinal ganglion cells, we found that CNN can learn their parameters in an interpretable fashion, and CNN network components are close to the biological underpinnings of the retinal circuit. With the benefit of the relative well-understood retinal circuit, our results suggest that the building-blocks of CNN are meaningful when they are applied to neuroscience for revealing network structure components.

Deep CNN is useful for modeling the abstract level of vision information [11, 12] and neural coding in general [48] in neuroscience. Our current study simplified the approach used by the previous studies [14, 12, 11] for the higher part of the visual cortex, where interpretable structure components of deep CNN are difficult due to the complex hierarchical layers from the retina to IT cortex in the brain. By training CNN with white noise images, our current work also simplified the interpretation, comparing to the studies where natural images were used for model visual neurons with CNN [26, 16, 22], since white noise images have the benefit to mapping out the receptive field of the visual neurons [29, 31].

Unlike most of the recent studies are driven by the goal that the better performance of neural response can be archived by using either feedforward and recurrent neural networks (or both), here we focus on the fine structure of the receptive field in the retinal circuit. Along with several recent papers [24, 26, 49], characterizing the receptive field of visual neurons is important for understanding the filters leaned by the CNN. Given the retina has a relatively clear and simple circuit, and the eye has (almost) no feedback connections from the cortical cortex, it is a suitable model system as a feedforward neural network, similar to the principle of CNN. Certainly, the contributions from the inhibitory neurons, such as horizontal cells and amacrine cells, play a role for the function of the retina. In this sense, the potential neural networks with lateral inhibition and/or recurrent units are desirable [17, 16].

Our approach is suitable to address other difficult issues of deep learning, such as transfer learning, since the domain of images seen by single RGCs is local and less complicated than the global structure of entire natural images. In the first case, it is surprising that transfer learning across different RGCs is not perfect given that the stimulus distribution is the same since the ensembles of white noise image stimuli for all RGCs converge to a naive Gaussian distribution. However, a further thought indicates that the CNN learned from each cell is rather specific with a particular set of filters in convolutional layers. In the current work, we only explore the filters of CNN. The future work is needed to investigate other components, such as nonlinearities, of CNN.

Here we also test the ability to transfer learning across different types of stimulus, i.e., transfer between white noise images to natural images. Traditionally, dynamical movies have been used in the studies of using CNN with the neuronal signals of interest as fMRI data [50], for neuronal spiking response, a sequence of flash images is often used [12, 16, 21]. One needs to consider the case how to use CNN for neuronal spikes to study dynamical visual scenes yet, i.e., continuous movies [51]. The future works are needed in this direction of studying the retinal computation under dynamical natural scenes.

The CNN here modeled for the retinal RGCs can be furtherer used for reconstruction of natural images based on the responses of retinal RGCs [52, 53]. Embedding the encoder/decoder into the retinal prosthesis has been suggested as a promising direction for visual restoration [54]. Such an approach of studying spike encoding and decoding of visual scenes with neural spikes will be crucial for the next generation of neuromorphic computing, including artificial visual system [53], where the data format processed on chips are digital spikes [55]. One expects that the close interaction of the algorithms based on spike data and CNN [56, 57, 58] with neuromorphic chips [59, 60] will greatly expand our computing capacity.


We thank other members of the National Engineering Laboratory for Video Technology for helpful discussions.


  • [1] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [2] L. N. Smith and N. Topin, “Deep convolutional neural network design patterns,” arXiv preprint arXiv:1611.00847, 2016.
  • [3] A. H. Marblestone, G. Wayne, and K. P. Kording, “Toward an integration of deep learning and neuroscience,” Frontiers in Computational Neuroscience, vol. 10, p. 94, sep 2016.
  • [4] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in

    European Conference on Computer Vision

    , 2014, pp. 818–833.
  • [5]

    H. Demis, K. Dharshan, S. Christopher, and B. Matthew, “Neuroscience-inspired artificial intelligence,”

    Neuron, vol. 95, no. 2, pp. 245–258, jul 2017.
  • [6] Y. Lecun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks and applications in vision,” in IEEE International Symposium on Circuits and Systems, 2010, pp. 253–256.
  • [7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [8]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    International Conference on Neural Information Processing Systems, 2012, pp. 1097–1105.
  • [9] E. P. Simoncelli and B. A. Olshausen, “Natural image statistics and neural representation,” Annual Review of Neuroscience, vol. 24, no. 24, p. 1193, 2001.
  • [10] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional networks for mid and high level feature learning,” in International Conference on Computer Vision, 2011, pp. 2018–2025.
  • [11] N. Kriegeskorte, “Deep neural networks: A new framework for modeling biological vision and brain information processing,” Annual Review of Vision Science, vol. 1, no. 1, pp. 417–446, nov 2015.
  • [12] D. L. K. Yamins and J. J. Dicarlo, “Using goal-driven deep learning models to understand sensory cortex,” Nature Neuroscience, vol. 19, no. 3, p. 356, 2016.
  • [13] D. Yamins, H. Hong, C. Cadieu, and J. J. Dicarlo, “Hierarchical modular optimization of convolutional networks achieves representations similar to macaque it and human ventral stream,” Advances in Neural Information Processing Systems, pp. 3093–3101, 2013.
  • [14] D. L. K. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo, “Performance-optimized hierarchical models predict neural responses in higher visual cortex,” Proceedings of the National Academy of Sciences, vol. 111, no. 23, pp. 8619–8624, may 2014.
  • [15] S.-M. Khaligh-Razavi and N. Kriegeskorte, “Deep supervised, but not unsupervised, models may explain IT cortical representation,” PLoS Computational Biology, vol. 10, no. 11, p. e1003915, nov 2014.
  • [16] L. McIntosh, N. Maheswaranathan, A. Nayebi, S. Ganguli, and S. Baccus, “Deep learning models of the retinal response to natural scenes,” in Advances in neural information processing systems, 2016, pp. 1369–1377.
  • [17] E. Batty, J. Merel, N. Brackbill, A. Heitman, A. Sher, A. Litke, E. J. Chichilnisky, and L. Paninski, “Multilayer recurrent network models of primate retinal ganglion cell responses.” in 5th International Conference on Learning Representations, 2017.
  • [18] P. J. Vance, G. P. Das, D. Kerr, S. A. Coleman, T. M. McGinnity, T. Gollisch, and J. K. Liu, “Bioinspired approach to modeling retinal ganglion cells using system identification techniques,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 5, pp. 1796–1808, 2018.
  • [19] N. Maheswaranathan, L. T. McIntosh, D. B. Kastner, J. Melander, L. Brezovec, A. Nayebi, J. Wang, S. Ganguli, and S. A. Baccus, “Deep learning models reveal internal structure and diverse computations in the retina under natural scenes,” bioRxiv, p. 340943, 2018.
  • [20] B. Vintch, J. A. Movshon, and E. P. Simoncelli, “A convolutional subunit model for neuronal responses in macaque v1,” Journal of Neuroscience, vol. 35, no. 44, pp. 14 829–14 841, 2015.
  • [21] J. Antolík, S. B. Hofer, J. A. Bednar, and T. D. Mrsic-Flogel, “Model constrained by visual hierarchy improves prediction of neural responses to natural scenes,” PLoS Computational Biology, vol. 12, no. 6, p. e1004927, jun 2016.
  • [22] W. F. Kindel, E. D. Christensen, and J. Zylberberg, “Using deep learning to reveal the neural code for images in primary visual cortex,” arXiv preprint arXiv:1706.06208, 2017.
  • [23] S. A. Cadena, G. H. Denfield, E. Y. Walker, L. A. Gatys, A. S. Tolias, M. Bethge, and A. S. Ecker, “Deep convolutional models improve predictions of macaque v1 responses to natural images,” bioRxiv, p. 201764, 2017.
  • [24] D. Klindt, A. S. Ecker, T. Euler, and M. Bethge, “Neural system identification for large populations separating textquotedblleft what”and textquotedblleft where textquotedblright,” in Advances in Neural Information Processing Systems.   Curran Associates, Inc., 2017, pp. 3509–3519.
  • [25] M. R. Whiteway, K. Socha, V. Bonin, and D. A. Butts, “Characterizing the nonlinear structure of shared variability in cortical neuron populations using neural networks,” bioRxiv, p. 407858, 2018.
  • [26] J. Ukita, T. Yoshida, and K. Ohki, “Characterization of nonlinear receptive fields of visual neurons by convolutional neural network,” bioRxiv, p. 348060, 2018.
  • [27] R. J. Rowekamp and T. O. Sharpee, “Cross-orientation suppression in visual area v2,” Nature communications, vol. 8, p. 15739, 2017.
  • [28] J. M. McFarland, Y. Cui, and D. A. Butts, “Inferring nonlinear neuronal computation based on physiologically plausible inputs,” PLoS Computational Biology, vol. 9, no. 7, p. e1003143, jul 2013.
  • [29] E. J. Chichilnisky, “A simple white noise analysis of neuronal light responses,” Network, vol. 12, no. 2, pp. 199–213, 2001.
  • [30] J. K. Liu and T. Gollisch, “Spike-triggered covariance analysis reveals phenomenological diversity of contrast adaptation in the retina,” PLoS Computational Biology, vol. 11, no. 7, p. e1004425, jul 2015.
  • [31] J. K. Liu, H. M. Schreyer, A. Onken, F. Rozenblit, M. H. Khani, V. Krishnamoorthy, S. Panzeri, and T. Gollisch, “Inference of neuronal functional circuitry with spike-triggered non-negative matrix factorization,” Nature Communications, vol. 8, no. 1, p. 149, jul 2017.
  • [32] S. Jia, Z. Yu, A. Onken, Y. Tian, T. Huang, and J. K. Liu, “Characterizing neuronal circuits with spike-triggered non-negative matrix factorization,” arXiv preprint arXiv:1808.03958, 2018.
  • [33] M. Helmstaedter, K. L. Briggman, S. C. Turaga, V. Jain, H. S. Seung, and W. Denk, “Connectomic reconstruction of the inner plexiform layer in the mouse retina,” Nature, vol. 500, no. 7461, pp. 168–174, 2013.
  • [34] H. Zeng and J. R. Sanes, “Neuronal cell-type classification: challenges, opportunities and the path forward,” Nature Reviews Neuroscience, vol. 18, no. 9, p. 530, 2017.
  • [35] R. E. Marc, B. W. Jones, C. B. Watt, J. R. Anderson, C. Sigulinsky, and S. Lauritzen, “Retinal connectomics: towards complete, accurate networks,” Progress in Retinal and Eye Research, vol. 37, pp. 141–162, 2013.
  • [36] H. S. Seung and U. Sümbül, “Neuronal cell types and connectivity: lessons from the retina,” Neuron, vol. 83, no. 6, pp. 1262–1272, 2014.
  • [37] J. R. Sanes and R. H. Masland, “The types of retinal ganglion cells: current status and implications for neuronal classification,” Annual Review of Vision Science, vol. 38, pp. 221–246, 2015.
  • [38] J. B. Demb and J. H. Singer, “Functional circuitry of the retina,” Annual Review of Vision Science, vol. 1, pp. 263–289, 2015.
  • [39] J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M. Litke, E. J. Chichilnisky, and E. P. Simoncelli, “Spatio-temporal correlations and visual signalling in a complete neuronal population,” Nature, vol. 454, no. 7207, p. 995, 2008.
  • [40] Q. Yan, Z. Yu, F. Chen, and J. K. Liu, “Revealing structure components of the retina by deep learning networks,” arXiv preprint arXiv:1711.02837, 2017.
  • [41] T. Gollisch and M. Meister, “Rapid neural coding in the retina with relative spike latencies,” Science, vol. 319, no. 5866, pp. 1108–11, 2008.
  • [42]

    A. Onken, J. K. Liu, P. C. R. Karunasekara, I. Delis, T. Gollisch, and S. Panzeri, “Using matrix and tensor factorizations for the single-trial analysis of population spike trains,”

    PLoS Computational Biology, vol. 12, no. 11, p. e1005189, nov 2016.
  • [43] J. L. Gauthier, G. D. Field, A. Sher, M. Greschner, J. Shlens, A. M. Litke, and E. Chichilnisky, “Receptive fields in primate retina are coordinated to sample visual space more uniformly,” PLoS Biology, vol. 7, no. 4, p. e1000063, apr 2009.
  • [44] J. Kaardal, J. D. Fitzgerald, n. Berry, M. J., and T. O. Sharpee, “Identifying functional bases for multidimensional neural computations,” Neural Computation, vol. 25, no. 7, pp. 1870–90, 2013.
  • [45] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas, “Predicting parameters in deep learning,” in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2013, pp. 2148–2156.
  • [46] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” Fiber, vol. 56, no. 4, pp. 3–7, 2015.
  • [47] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2014, pp. 3320–3328.
  • [48] J. I. Glaser, R. H. Chowdhury, M. G. Perich, L. E. Miller, and K. P. Kording, “Machine learning for neural decoding,” arXiv preprint arXiv:1708.00909, 2017.
  • [49] A. S. Ecker, F. H. Sinz, E. Froudarakis, P. G. Fahey, S. A. Cadena, E. Y. Walker, E. Cobos, J. Reimer, A. S. Tolias, and M. Bethge, “A rotation-equivariant convolutional neural network model of primary visual cortex,” arXiv preprint arXiv:1809.10504, 2018.
  • [50] H. Wen, J. Shi, Y. Zhang, K.-H. Lu, J. Cao, and Z. Liu, “Neural encoding and decoding with deep learning for dynamic natural vision,” Cerebral Cortex, pp. 1–25, 2017.
  • [51] T. Naselaris, R. J. Prenger, K. N. Kay, M. Oliver, and J. L. Gallant, “Bayesian reconstruction of natural images from human brain activity,” Neuron, vol. 63, no. 6, pp. 902–915, 2009.
  • [52] N. Parthasarathy, E. Batty, W. Falcon, T. Rutten, M. Rajpal, E. Chichilnisky, and L. Paninski, “Neural networks for efficient bayesian decoding of natural images from retinal neurons,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds.   Curran Associates, Inc., 2017, pp. 6437–6448.
  • [53] Z. Yu, J. K. Liu, S. Jia, Y. Zhang, Y. Zheng, Y. Tian, and T. Huang, “Towards the next generation of retinal neuroprosthesis: Visual computation with spikes,” Engineering.
  • [54] S. Nirenberg and C. Pandarinath, “Retinal prosthetic strategy with the capacity to restore normal vision,” Proceedings of the National Academy of Sciences, vol. 109, no. 37, pp. 15 012–15 017, aug 2012.
  • [55] S. Dong, T. Huang, and Y. Tian, “Spike camera and its coding methods,” in Data Compression Conference (DCC), 2017.   IEEE, 2017, pp. 437–437.
  • [56] Y. Hu, H. Tang, Y. Wang, and G. Pan, “Spiking deep residual network,” arXiv preprint arXiv:1805.01352, 2018.
  • [57]

    Q. Xu, Y. Qi, H. Yu, J. Shen, H. Tang, and G. Pan, “Csnn: An augmented spiking based framework with perceptron-inception.” in

    IJCAI, 2018, pp. 1646–1652.
  • [58] R. Xiao, H. Tang, P. Gu, and X. Xu, “Spike-based encoding and learning of spectrum features for robust sound recognition,” Neurocomputing, vol. 313, pp. 65–73, 2018.
  • [59] S. Esser, P. Merolla, J. Arthur, A. Cassidy, R. Appuswamy, A. Andreopoulos, D. Berg, J. McKinstry, T. Melano, D. Barch et al., “Convolutional networks for fast, energy-efficient neuromorphic computing.” Proceedings of the National Academy of Sciences, vol. 113, no. 41, pp. 11 441–11 446, 2016.
  • [60] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain et al., “Loihi: A neuromorphic manycore processor with on-chip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, 2018.