Our ability to see arises because of the activity evoked in our brains as we view the world around us. Ever since Hubel and Wiesel mapped the flow of visual information from the retina to thalamus and then cortex, understanding how these different regions encode and process visual information has been a major focus of visual systems neuroscience. In the first cortical layer of visual processing– primary visual cortex (V1)– Hubel and Wiesel identified neurons that respond to oriented edges within image stimuli. These are called simple or complex cells, depending on how sensitive the neurons’ responses are to shifts in the position of the edge hubel1959 . The simple and complex cells are well studied lehky1992 ; david2004 ; montijn2016 . However, many V1 neurons are neither simple nor complex cells, and the classical models of the simple and complex cells often fail to predict how those neurons will respond to naturalistic stimuli olshausen2005 . Thus, much of how V1 encodes visual information remains unknown. We use deep learning to address this longstanding problem.
Recent advances in neural recording technology and machine learning have put solving the V1 neural code within reach. Experimental technology for simultaneously recording from large populations of neurons– such as multielectrode arrays– has opened the door to studying how the collective behavior of neurons encodes sensory information. Moreover, methods of machine learning, inspired by the anatomy of the mammalian visual system, known as convolutional neural networks, have achieved impressive success in increasingly difficult image classification taskskrizhevsky2012 ; lecun2015 . Recently, these artificial neural networks have been used to study the visual system Yamins2016 , setting the state of the art for predicting stimulus-evoked neural activity in the retina mcintosh2016 and inferior temporal cortex (IT) yamins2014 . Despite these successes, we have not yet achieved a full understanding of how V1 represents natural images.
In this work, we present a convolutional neural network that predicts V1 activity patterns evoked by natural image stimuli. We use this network to predict the activity of 355 individual neurons in macaque monkey V1, and in doing so, this network represents the neural visual code for many neurons regardless of cell type. On held out validation data, the network predicts firing rates that are highly correlated () with the neurons’ actual firing rates. For 15% of these neurons, the firing rates are predicted to within 10% of the theoretical limit set by the trial-to-trial variation in the neural responses. To advance our understanding of the visual processing that takes place in V1, we invert the network to identify visual features that cause individual cells to spike. In the process, we identify novel functional cell types in monkey V1.
2.1 Experimental data
We use publicly-available multielectrode recordings from macaque V1 coen2015 . In these experiments, macaque monkeys were anesthetized and then presented with a series of images, while the experimenters simultaneously recorded the spiking activity of a population of neurons in V1 (Fig. 1A,B) with a multielectrode array. These recordings were conducted in 10 experimental sessions with 3 different animals, resulting in recordings from a total of well-isolated neurons. A full description of the data and experimental methods can be found in Coen-Cagli et al. coen2015 . We use of these neurons from one session to determine how we construct our network (its hyper-parameters), and the remaining neurons to evaluate its performance. For each neuron , we calculate the mean firing rate evoked by each image , by averaging its firing rate across the 20 repeated presentations of that image. The firing rates are calculated over a window from 50 to 100 ms after the image was presented, to account for the signal propagation delay from retina to V1.
We analyze the responses to 270 natural images that are circularly cropped to the size of the retinal visual field (Fig. 1B). The full dataset contains responses to natural and artificial stimuli both full-sized and cropped. We use only natural images because we are interested in the real-world behavior of the visual system, and we use only the cropped images because these images have the same visual field as the grating stimuli that we use to characterize the neurons as either orientation selective or not. Prior to training the neural network, we downsample the images using a non-overlapping 2 2 window and crop them to a size of 33 33 pixels.
2.2 Deep neural network model
To construct our predictive network, we use a convolutional neural network whose input is an image and whose output is the predicted firing rates of every neuron in a given recording session. As shown in Fig. 2, the network consists of a series of linear-nonlinear layers. The first layer(s) performs local feature extraction on the image by sweeping banks of convolutional filters over the image, and then applying a maximum pooling operation. These local features are then globally combined at the all-to-all layer(s) to generate the predicted firing rate for every neuron in that data session.
The number of each type of layer (convolutional with maximum pooling, or all-to-all), and the details about each layer (number of units, convolution stride, etc.), are optimized to maximize the accuracy of the neural activity predictions on the 37 neurons recorded in the second data session. We do this using a combination of manual and automated searches, where the results of our manual search inform the range of the hyper-parameter space for an automated random searchBergstra2012 . Using the optimal parameters (Table 1) we train and evaluate our network with the remaining 9 datasets.
We train our network using a cross-validation procedure where we randomly subdivide a given dataset into a training subset (80% of the images and corresponding V1 activity patterns) and an evaluation subset (20% of the images). We then train all layers of our network using the TensorFlow Python package with the gradient-descent optimizer. We attribute a loss
to each neuron (indexed by ), where is the image index, is the measured response, and is the network’s predicted response. The neurons’ losses are summed yielding the total loss used by the optimizer. To ensure the performance generalizes, the training data is subdivided into data used by the optimizer to train the weights, and another small subset (14% of the images) to stop the training when accuracy stops improving (early stopping).
To quantify the performance of the predictor, we compare the network’s predicted firing rates to the neurons’ measured firing rates using a held-out evaluation set. This set is neither used to determine the hyper-parameters, nor to train the weights in our neural network. We calculate the Pearson correlation coefficient
between the predicted firing rates and the measured rates, for each neuron. To enable comparison with other work, we also calculate how much of the variance in the neurons’ mean firing rates (over all test images) could be explained by the network’s predictions: the fraction of variance explained (FVE).
|Conv. layer(s)||2||Dropout keep rate||0.55|
|Conv. 1||Conv. 2||All-to-all|
|Num. filters||16||Num. filters||32||Num. elem.||300|
|Conv. kernel||7||Conv. kernel||7|
|Maxpool stride||2||Maxpool stride||2|
|Maxpool kernel||3||Maxpool kernel||3|
2.3 Benchmarking the performance measures
Because of the trial to trial variability in neural activity, no predictor could achieve
. To understand how well our network can predict the V1 neurons’ firing rates, we compare its predictability to a theoretical maximum value set by the variability in the neural responses. To compute this maximum, we generate fake data by drawing random numbers from Gaussian distributions with the same statistics as the measured neural data. For each neuron and image, we average over 20 of these such values to obtain a simulated prediction. We then compute the correlation between these simulated predictions and the neurons’ actual mean firing rates to find the maximum correlationpossible given the variability.
2.4 Characterizing the selectivity of cells
To interpret the breadth of our results, we group the cells into functional classes, and look at how well the firing rates from each class can be predicted by our neural network model. We classify cells by how selective they are to specific natural images, and by their selectivity to specific orientations of grating stimuli.
The selectivity of each neuron to specific natural images is quantified by the
where is the cell’s firing rate indexed over the set of images zylberberg2013 . This index has a value of zero for neurons that fire equally to all images, and a value of 1 for cells that spike in response to only one of the images.
The orientation selectivity is measured by the cell’s
where is the neuron’s firing rate in response to a grating oriented at angle . The circular variance is less sensitive to noise than the more commonly-used orientation selectivity index mazurek2014 . Following the results of Mazurek et al. mazurek2014 , we use the thresholds: to define orientation selective cells (the simple and complex cells according to the Hubel and Wiesel convention), and for non-orientation-selective cells. We omit all other cells from these two groupings.
2.5 Identifying visual features that cause the neurons to spike
To use our model as a tool to investigate the functions of individual neurons, we use DeepDream-likeMahendran2015
techniques to identify the visual features that cause each cell to spike. We invert our network by finding input images that cause a given cell to spike at a pre-specified level. To do this, we first take the fully trained network, and set Gaussian white noise images as the input. We then use backpropagation to modify the pixel values of the input image to push the chosen neuron’s predicted firing rate towards the pre-specified level. Thus, we find an input image that induces the pre-specified response. We applied this procedure to several different neurons, and at several different target firing rates.
With our optimal network (Table 1), the predicted firing rates are highly correlated with the measured neuron’s firing rates for most neurons (Fig. 3A) when evaluated on held-out data. The correlation between the predictions and actual neural firing rates is () averaged over all 355 neurons in the evaluation set (Fig. 3B). Given the noise in the neural responses, a perfect predictor would achieve a correlation . Therefore, our predictor achieves 66% of the maximum possible performance. Moreover, for many neurons, the predictability approaches this theoretical maximum: for 54 of the 355 cells, the predictability was within 10% of the theoretical maximum. To show that our predictor performs complex, nonlinear processing on the images, we compare our results to a linear model that predicts firing rates based on each pixel value of the input images. The linear model’s predictions have a correlation of to the measured firing rates, which is substantially lower than the network model (Fig. 3B).
Because simple and complex cells have been extensively studied, we are motivated to compare the predictability of simple and complex cells to the predictability of the other neurons in the dataset. Grouping the cells into orientation selective– simple and complex-like cells– and non-orientation-selective cells (see Methods), we find that our network predicts non-orientation-selective cell responses with , and orientation selective cell responses with (Fig. 3B). Therefore, our model predicts the firing rates of both cell types, and performs slightly better on the simple and complex-like cells than the non-orientation-selective cells.
Given that some neurons’ firing rates are well predicted by the network while others are not, we are motivated to ask what distinguishes predictable from unpredictable cells. To answer this question, we quantify the cells’ orientation selectivity, and their image selectivity (see Methods). Comparing the predictability of each cell’s firing rates with its respective image selectivity index (Fig. 3D), and circular variance (Fig. 3C), we find that the predictability depends only weakly on these characteristics of the cells. Regardless of these values, some neurons’ firing rates are well predicted while others are not. Thus, the orientation selectivity and image selectivity are only minor factors in determining how well our model performs.
3.1 Identifying visual features that cause the neurons to spike
By construction, our convolutional neural network assumes nothing about the neurons’ receptive fields. They are learned exclusively from the training data. Therefore, we can use our network to determine the response properties of both well-characterized neurons (simple and complex cells) and poorly understood neurons in an unbiased manner. To do this, we invert the network and identify visual features that evoke specified responses in several of the well predicted neurons. Based on the measured firing rate distribution, we repeat this procedure for different target firing rates, from low (at the 20th percentile of the neuron’s firing rate distribution) to high (80th percentile).
As seen in Fig. 4, this method allows us to classify several different types of cells, and identify novel response features. Cells A () and B () appear to function like previously characterized cells. Cell A responds to a center-surround image feature, and cell B’s receptive field is a Gabor wavelet. In contrast, cells C () and D () respond to more abstract image features that are not well represented by simple localized image masks.
We train a deep convolutional neural network to predict the firing rates of neurons in macaque V1 in response to natural image stimuli. We find that the firing rates of both orientation-selective, and non-orientation-selective neurons can be predicted with high accuracy. Moreover, we find that the network can identify the image features that cause the neurons to spike. This procedure reveals both canonical localized neural receptive fields (such as Gabor wavelets and center-surrounds), as well as abstract image features (previously uncharacterized in V1 neural receptive fields) that were not localized to a single region of the image. Our results have implications for developing new computer vision algorithms as well as studying the visual centers of the brain.
4.1 Comparisons to other work
Studying visual processing in V1 we find that the optimal architecture of our convolutional neural network is relatively shallow compared to recent results by Yamins et al. yamins2014 . Deep neural networks require large training datasets to generalize srivastava2014 . With only 270 images to train and evaluate our network, the optimal architecture that we find is likely a balance between the architecture that truly represents V1 encoding given infinite data and one that generalizes well. To probe whether a deeper network could more truly represent V1, a much larger dataset is required; this highlights the persistent limitation that small datasets impose on deciphering the neural code.
Although, it is difficult to fairly compare the performances of published results for a variety of factors, we predict neural activity with performance that is comparable to the state of the art. Over all neurons, the correlation between our network predictions and the actual neural firing rates is . For comparison, Lau et al. lau2002 achieved predictability of for simple cells and for complex cells, and Prenger et al. prenger2004 achieved averaged over all cells. Lehky et al. lehky1992 achieved and Willmore et al. Willmore2010 achieved a predictability of . However, some contextual factors confound direct comparison to these results. Specifically, Lehky et al. selected neurons that are easier to predict by specifically choosing neurons that responded strongly to the presentation of bars of light, and Willmore et al. adjusted their image to match the respective field of each neuron they predicted. Despite methodology differences, we report comparable performance or better to recent published results.
4.2 Implications for machine learning
While supervised learning methods lead to impressive performance on image categorization taskskrizhevsky2012 ; lecun2015 , the trained networks are easily fooled by imperceptible image manipulations nguyen2015 , and they require large amounts of training data to achieve high levels of performance. Primates, however, are not so easily fooled. They can learn to perform classification tasks given only small numbers of training examples. Thus, it may be possible to improve the deep networks used for computer vision by building on the primate brain’s representation. Deep neural networks – like the one presented here – could be pre-trained to predict primate V1 firing patterns, and subsequently trained to perform object recognition tasks. Thus, combining our approach with traditional learning could lead to more robust and data-efficient algorithms.
4.3 Implications for neuroscience and medicine
By inverting our network (Fig. 4), we show that we can use the network as a tool to investigate the neurons’ response properties. As demonstrated by distinguishing Gabor wavelet (cell B) from center-surround (cell A) receptive fields, this tool can identify and classify functional cell types. Going forward, this tool shows promise for characterizing the response properties of more cells in V1, and precisely defining functional cell types that were previously overlooked. Looking beyond V1, these methods could be applied to understanding higher level cortical processing, such as visual encoding in V2. By finding the features that elicit a response in V2 neurons, this tool could help fill the visual encoding knowledge gap Ziemba2016 that exists between the abstract encoding of IT and V4 and the low-level encoding of the retina and V1.
Additionally, our results have taken a key step towards cracking the neural code for how visual stimuli are translated into neural activity in V1. This would be a major step forward in sensory neuroscience, and would enable new technologies that could restore sight to the blind. For example, cameras could continuously feed images into networks that would determine the precise V1 activity patterns that correspond to those images: a camera to brain translator. Brain stimulation methods like optogenetics ozbay2015 could then be used to generate those same activity patterns within the brain, thereby restoring sight.
To master neural encoding, we propose closing the loop between conducting the experiments and performing analysis. By using the network to generate visual stimuli hypothesized to evoke particular patterns of neural activity, experiments could directly probe the neural code, and in doing so pave the way for a new class of neuroscience experiments.
We thank Adam Kohn and Ruben Coen-Cagli for providing the experimental data, and we thank Gidon Felson, John Thompson, Adam Kohn and Ruben Coen-Cagli for providing invaluable feedback. This research is supported by the National Institutes of Health under award numbers T15 LM009451 and T32 GM008497, the Canadian Institute for Advanced Research (CIFAR) Azrieli Global Scholar Award, and the Google Faculty Research Award.
- (1) D. H. Hubel and T. N. Wiesel. Receptive fields of single neurones in the cat’s striate cortex. The Journal of Physiology, 148(3):574–91, 1959.
- (2) S. R. Lehky, T. J. Sejnowski, and R. Desimone. Predicting responses of nonlinear neurons in monkey striate cortex to complex patterns. Journal of Neuroscience, 12(9), 1992.
- (3) Stephen V. David, William E. Vinje, and Jack L. Gallant. Natural Stimulus Statistics Alter the Receptive Field Structure of V1 Neurons. Journal of Neuroscience, 24(31), 2004.
- (4) Jorrit S. Montijn, Guido T. Meijer, Carien S. Lansink, and Cyriel M.A. Pennartz. Population-Level Neural Codes Are Robust to Single-Neuron Variability from a Multidimensional Coding Perspective. Cell Reports, 16(9):2486–2498, 2016.
- (5) Bruno A. Olshausen and David J. Field. How Close Are We to Understanding V1? Neural Computation, 17(8):1665–1699, 2005.
- (6) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, pages 1–9, 2012.
- (7) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
- (8) Daniel L. K. Yamins and James J. DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience, 19(3):356–365, 2016.
- (9) Lane McIntosh, Niru Maheswaranathan, Aran Nayebi, Surya Ganguli, and Stephen Baccus. Deep Learning Models of the Retinal Response to Natural Scenes. In Advances in Neural Information Processing Systems 29, pages 1369–1377. 2016.
- (10) Daniel L. K. Yamins, Ha Hong, Charles F Cadieu, Ethan A. Solomon, Darren Seibert, and James J. DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences of the United States of America, 111(23):8619–24, 2014.
- (11) Ruben Coen-Cagli, Adam Kohn, and Odelia Schwartz. Flexible gating of contextual influences in natural vision. Nature Neuroscience, 18(11):1648–1655, 2015.
- (12) James Bergstra and Yoshua Bengio. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13:281–305, 2012.
- (13) Joel Zylberberg and Michael Robert DeWeese. Sparse coding models can exhibit decreasing sparseness while learning sparse codes for natural images. PLOS Computational Biology, 9(8):1–10, 2013.
- (14) Mark Mazurek, Marisa Kager, and Stephen D. Van Hooser. Robust quantification of orientation selectivity and direction selectivity. Frontiers in Neural Circuits, 8:92, 2014.
Aravindh Mahendran and Andrea Vedaldi.
Understanding Deep Image Representations by Inverting Them.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- (16) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
- (17) Brian Lau, Garrett B. Stanley, and Yang Dan. Computational subunits of visual cortical neurons revealed by artificial neural networks. Proceedings of the National Academy of Sciences of the United States of America, 99(13):8974–9, 2002.
- (18) Ryan Prenger, Michael C.-K. Wu, Stephen V. David, and Jack L. Gallant. Nonlinear V1 responses to natural scenes revealed by neural network analysis. Neural Networks, 17(5):663–679, 2004.
- (19) Ben D. B. Willmore, Ryan J. Prenger, and Jack L. Gallant. Neural Representation of Natural Images in Visual Area V2. Journal of Neuroscience, 30(6):2102–2114, 2010.
- (20) Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 427–436, 2015.
- (21) Corey M. Ziemba, Jeremy Freeman, J. Anthony Movshon, and Eero P. Simoncelli. Selectivity and tolerance for visual texture in macaque V2. Proceedings of the National Academy of Sciences, 113(22):E3140–E3149, 2016.
- (22) Baris N. Ozbay, Justin T. Losacco, Robert Cormack, Richard Weir, Victor M. Bright, Juliet T. Gopinath, Diego Restrepo, and Emily A. Gibson. Miniaturized fiber-coupled confocal fluorescence microscope with an electrowetting variable focus lens using no moving parts. Optics Letters, 40(11):2553, 2015.