A great mystery in neuroscience is understanding how the brain effortlessly performs high-level visual tasks such as object recognition and scene understanding. Answering this question is of great scientific importance, not only for neuroscience, but also for developing better computer vision algorithms. One common method for investigating vision is to use functional magnetic resonance imaging (fMRI) to measure brain activity from human subjects passively viewing natural images. Many experiments contrast brain activity elicited by specific images categories . This functional localizer approach has been used to identify many regions of interest (ROIs) in the visual pathway that appear to represent high-level semantic information . Some ROIs appear to represent the presence of animate features such as body parts and faces: the extrastriate body area (EBA; ), occipital face area (OFA; ), and the fusiform face area (FFA; ). Others appear to represent information contained in natural scenes: the occipital place area (OPA; ), the parahippocampal place area (PPA; ), and the retrosplenial cortex (RSC; ).
Recently a more powerful approach for investigating visual representation in the human brain has been developed, based on the idea of formulating computational encoding models. Encoding models aim to create a nonlinear mapping between the stimulus and measured brain activity. The encoding model approach is much more sensitive than the conventional functional localizer approach because a single experiment can contain an arbitrary number of categories. Furthermore, the fit encoding models provide quantitative predictions of brain activity for new stimuli that were not used to fit the models.
One previous study used the encoding model approach to investigate semantic representation in higher visual areas of the human brain. This study used a linearizing feature space to mediate between the visual stimuli and measured brain activity. The features were obtained by annotating images by hand, using binary vectors that indicated the presence or absence of specific semantic categories 
. These feature vectors were then regressed onto brain activity using regularized linear regression.
One drawback of the approach used in the earlier study is that each image was annotated by hand by a human. This hand annotation process inevitably introduces subjective selection and interpretation bias into the labels, and this inevitably biases the fit encoding models. For example, an image of a rose can be labeled as a rose, flower, plant, or shrub, depending on the intuitions of the labeler. Another significant drawback to hand annotation is that it is slow. The speed of hand annotation inevitably constrains the number of labeled images that can be used as input to the encoding model. This will inevitably limit the space of encoding models that can be explored, and so will likely produce a suboptimal model of the brain.
Although the previous semantic encoding model provided good predictions of human brain activity in higher visual areas, the requirement of hand annotation is unsatisfying. A fully satisfying model of human vision should predict activity across the entire visual hierarchy directly from pixels, without the need for any human intervention. This requires a means for automatically encoding the semantic content of a scene directly from pixels. Recent breakthroughs in computer vision and machine learning have provided algorithms that can accurately perform many high-level visual tasks, such as object recognition and scene classification, directly from image pixels [5, 16]. Motivated by these results, here we use computer vision and machine learning algorithms to create candidate feature spaces that are used in turn to model brain activity. We focus on two state-of-the-art computer vision algorithms. The first is based on Fisher Vector (FV) encoding of local image descriptors [3, 17]. The second is based on a hierarchical image representation learned by a convolutional neural network (ConvNet) [4, 5]. We use feature spaces discovered by both of these approaches to model brain activity in single voxels distributed across visual cortex. We also compare the performance of these models to previously-published encoding models based on semantic annotations provided by humans .
We show that both the FV and ConvNet models predict human brain activity accurately in high-level visual areas, and that the performance of these models is commensurate with the earlier model based on human annotations. However, the FV and ConvNet models also predict activity accurately in early and intermediate stages of the visual pathway, which the model based on hand annotaction cannot do. The utility of the FV and ConvNet models for predicting responses across the visual hierarchy suggests that they might also be useful for investigating low-and intermediate-level visual processing in the human brain. Finally, we show that the ConvNet model can be used to recover the visual receptive fields from single voxels. This is another unique benefit that cannot be obtained from encoding models based on semantic tags. Taken together, these results clearly demonstrate the power of combining computer vision and machine learning methods with the encoding model approach to human vision.. These results also demonstrate the value of methods that bridge computer vision, machine learning and neuroscience.
The source data for this study were functional magnetic resonance imaging (fMRI) recordings of human brain activity (specifically, the blood-oxygenation-level-dependent (BOLD) signal), recorded continuously while two subjects passively viewed a series of static photos of color natural scenes 
. The original study measured brain activity elicited by 1260 images shown twice each, and another set of 126 images shown 12 times each. Activity was measured in 100,000 voxels (i.e., volumetric pixels) located in the cerebral cortex of each subject. We used FV and ConvNet to construct a separate encoding model for each voxel that mapped optimally from the pixels in each estimation image into brain activity evoked by that image (Figure1). We then evaluated predictions of the fit models using brain activity evoked by the validation images.
2.1 Constructing Encoding Models
An encoding model consists of a feature space that provides a linearizing transformation between the stimulus images and measured brain activity. Here we constructed three different feature spaces by projecting images in the estimation set through FV (Section 2.2), ConvNets (sec 2.3) and the 19-Cat space (Sections 2.4–2.3). Next, for every voxel, we used regularized linear regression to find a set of weights that predicted voxel activity from the feature-space representations of each image. (A single regularization parameter was chosen for all voxels, using five-fold cross-validation.) The accuracy of each encoding model for each voxel was expressed as the correlation (
) between predicted and observed voxel activity, using the validation set reserved for this purpose. The explained variance in each voxel’s responses was calculated as the square of correlation coefficient. Prediction accuracy was deemed statistically significant if the correlation exceeded () (for details, see Supplementary Material).
2.2 Fisher-Vector (FV) Feature Representation
The FV encoding model used a feature space derived from high-order edge statistics of natural image patches. To learn the feature space, a dictionary of
prototypical image patch features was first learned using a Gaussian Mixture Model (GMM) applied to the SIFT descriptors obtained from thousands of random natural image patches. SIFT features capture the distribution of edge orientation energy in each patch. The number of prototypical patches in the dictionary was 64. We chose this value to maximize prediction accuracy of voxel activity, using a portion of the estimation data reserved for this purpose. FV features for each image reflect the vector distance between SIFT features for patches sampled from a multi-scale grid of locations across that image, and the prototypical features learned by the GMM  (see Figure 2 for an illustration).
2.3 Convolutional Neural Network (ConvNet) Feature Representation
The ConvNet encoding model used a feature space derived from the various layers of ConvNet. The feature space is learned by training ConvNet on the task of image classification ,. Here we use the seven-layered ConvNet architecture proposed by ,
. The first five layers are devoted to convolutions (denoted conv-1 through conv-5). The last two layers are fully connected (fc-6, fc-7). We trained the ConvNet on the ImageNet database
, which consists of over 1 million natural images classified into 1000 distinct object categories. The ConvNet features for each image in the estimation set consisted of the feature activations in each layer of the ConvNet. This resulted in 7 possible feature spaces for each image, one corresponding to each layer of the ConvNet. We selected the optimal ConvNet feature space for each voxel by maximizing prediction accuracy of voxel activity, using a portion of the estimation data reserved for this purpose (Figure2).
2.4 19-Category (19-Cat) Feature Representation
The 19-Cat encoding model used a simple feature space that consists of 19-dimensional binary vector indicating the presence (1) or absence (0) of 19 semantic image categories (e.g. ”furniture,” ”vehicle,” ”water,” etc. (see ). These categories are likely only a small subset of all those represented in the human brain, but because brain activity measurements are signal limited the 19-Cat model predicts brain activity nearly as well as more complicated semantic models . We used the 19-Cat model as the benchmark for evaluating the FV and ConvNet models developed in this study.
3 Encoding Model Performance
In the encoding model framework, the evidence for or against any model is provide by predictions of brain activity in a separate data set reserved for this purpose. Therefore, to determine whether feature spaces derived from computer vision and machine learning are useful for explaining human vision we compared predictions of FV and ConvNet encoding models to the performance of the established 19-Cat model. This comparison is summarized in Figures 3 and 4.
For voxels located in lower visual areas, both the FV and ConvNet based models outperform the 19-Cat model. This result makes sense: the feature spaces represented by the FV and ConvNet models incorporate structural information which is absent from the 19-Cat model. Because early visual areas are known to be selective for structural information in natural images, the FV and ConvNet models can predict voxel activity in early areas but the 19-Cat model cannot. For voxels located in higher visual areas,, predictions of the FV and ConvNet models are correlated with those of the 19-Cat model, but the ConvNet model outperforms the 19-Cat model in a few anterior areas (e.g. OFA and EBA). Thus, encoding models based on feature spaces derived using current computer vision or machine learning algorithms predict activity in many visual areas better than models built using hand annotations.
Figure 4 shows that the ConvNet model generally provides better predictions than those of the FV model. The first layer of ConvNet learns Gabor-like features whose spatial profiles are similar to V1 receptive fields ( ), so perhaps this difference is not surprising in area V1. However, the ConvNet model also provides relatively more accurate predictions in intermediate areas like V4 and LO. These areas are believed to be involved in form processing and object segmentation, but the features represented therein are poorly understood. Our results that the intermediate layers of ConvNet may provide a useful set of features for studying visual processing in these intermediate visual areas.
4 Investigating Voxel Tuning
The previous section demonstrates that the feature spaces provided by FV and ConvNets can be used to predict accurately brain activity in many visual areas. This suggests that we might be able to gain a better understanding of human visual representation by examining the FV and ConvNet models fit to individual voxels. To visualize the features represented in a single voxel we first used the weights of the fit ConvNet model to generate theoretical responses to a large collection of natural images. (We restricted our analysis to ConvNet models here because they generally provide the most accurate predictions.) We then rank-ordered the images according to the responses predicted by the fit model. The top and bottom images within this ranking provide qualitative intuition about the features that are represented by a particular voxel.
Figure 5 shows results for several voxels sampled from the V1, V4, EBA, and PPA. Activity in the V1 voxel is predicted to increase when an image consists of high-frequency texture and activity is predicted to decrease it consists of blue low-frequency texture. This pattern of selectivity is reminiscent of contrast-sensitivity  reported previously in neurophysiological studies of V1. Activity of the EBA voxel is predicted to increase when an image contains people or animals and to decrease when it contains scenes or texture, while the PPA voxel is predicted to increase activity when images contain scenes and to decrease activity when they contain texture. These patterns of selectivity are consistent with previous reports of these areas [7, 9, 13]. The V4 voxel is of particular interest because the visual features represented in area V4 are largely unknown. Activity of the V4 voxel is predicted to increase when the center of the image contains an orange blob and to decrease when it contains large-scale texture. This is at least qualitatively consistent with neurophysiological reports that area V4 is selective for curvature and radial patterns .
The encoding model approach also provides new opportunities for investigating the fine-grained structure of classical ROIs identified in earlier studies 
. As a demonstration of this we performed K-Means clustering of the ConvNet model weights for all of the voxels within area EBA whose activity was predicted significantly. The results revealed two stable clusters of voxels within EBA. One cluster (C1) is predicted to increase activity when images contain full bodies in motion and to decrease when they contain round, up-close objects. The second cluster (C2) is predicted to increase activity when images contain humans and to decrease when they depict outdoor scenes. (Figure6). Projection of the functional clusters onto cortical flat maps suggests that the clusters are also spatially segregated (Figure 7). Thus, this result suggests that EBA contains two distinct functional subdivisions. If this is true, then the average ConvNet model fit to all voxels in C1 should predict the activity of C1 voxels significantly better than the average ConvNet model fit to voxels in C2, and vice-versa. As shown in table 1, this prediction is confirmed (see Supplemental Material for details).
|Subject-1 (S1)||Subject-2 (S2)|
|Explained Variance in C1|
|Explained Variance in C2|
5 Conclusions and Future Directions
In this work we sought to leverage recent advances in computer vision and machine learning to develop encoding models that accurately predict human brain activity evoked by complex natural images. Previous encoding models based on hand annotations of natural images produced good predictions but were unsatisfactory. We investigated two models, one based on FV and one based on ConvNets. We find that these models predict brain activity across many low- and high-level visual areas with an accuracy commensurate with previous models. This is a remarkable result, because these predictions were based entirely on features learned by the FV and ConvNet algorithms and did not require any human annotation. The fact that FV and ConvNet models explain brain activity across visual cortex suggests that human brain is exquisitely tuned to natural scene statistics.
The ConvNet encoding model provides a powerful new way to investigate visual representation in the human brain. The models fit to individual voxels can be probed in order to visualize the patterns predicted to increase or decrease brain activity. This exercise confirms previous findings and leads to insights about representation in intermediate visual areas. The models can also be used to explore conventional ROIs in more detail. For example, we find that area EBA consists of two functionally and spatially segregated subdivisions. Together, these results demonstrate the power of combining modern methods of computer vision and machine learning with the encoding model approach to fMRI.
This work was supported by grants to Jack L. Gallant from the National Eye Institute (EY019684), the National Institute of Mental Health (MH66990), and the National Science Foundation Center for the Science of Information (CCF-0939370). Pulkit Agrawal was supported on Fulbright Science and Technology Award. We thank NVIDIA corporation for providing us with GPUs. We thank Alexander G. Huth, Mark Lescroart, Anwar Nunez-Elizalde and Brian Cheung for their helpful discussions and comments.
-  A G Huth, S Nishimoto, A T Vu, and J L Gallant, “A continuous semantic space describes the representation of thousands of object and action categories across the human brain.,” Neuron, vol. 76, no. 6, pp. 1210–24, Dec. 2012.
-  D Stansbury, T Naselaris, and J Gallant, “Natural scene statistics account for the representation of scene categories in human visual cortex.,” Neuron, vol. 79, no. 5, pp. 1025–34, Sept. 2013.
-  F Perronnin, S Jorge, and T Mensink, “Improving the Fisher Kernel for Large-Scale Image Classification,” European Conference on Computer Vision, pp. 143–156, 2010.
Y LeCun, B Boser, J S Denker, D Henderson, R E Howard, W Hubbard, and L D
“Backpropagation applied to handwritten zip code recognition,”Neural computation, vol. 1, no. 4, 1989.
-  A Krizhevsky, I Sutskever, and G E Hinton, “Imagenet classification with deep convolutional neural networks.,” in NIPS, 2012.
-  Richard B. Buxton, Introduction to functional magnetic resonance imaging book pack: Principles and techniques, Cambridge University Press, 2002.
-  K.J. Friston, A. Holmes, J.B. Poline, C.J. Price, and C.D. Frith, “Detecting activations in PET and fMRI: levels of inference and power,” NeuroImage, vol. 4, no. 3, pp. 223–235, 1996.
-  M Spiridon, B Fischl, and N Kanwisher, “Location and spatial profile of category-specific regions in human extrastriate cortex,” Human Brain Mapping, vol. 27, no. 1, pp. 77–89, 2006.
-  I Gauthier, M J Tarr, J Moylan, P Skudlarski, J C Gore, and A W Anderson, “The fusiform ”face area” is part of a network that processes faces at the individual level,” Journal of Cognitive Neuroscience, vol. 12, no. 3, pp. 495–504, 2000.
-  I Gauthier, M J Tarr, J Moylan, P Skudlarski, J C Gore, and A W Anderson, “The fusiform ”face area” is part of a network that processes faces at the individual level,” Journal of Cognitive Neuroscience, vol. 12, no. 3, pp. 495–504, 2000.
-  N Kanwisher, J McDermott, and M M Chun, “The fusiform face area: a module in human extrastriate cortex specialized for face perception,” The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, vol. 17, no. 11, pp. 4302–4311, 1997.
-  K. Nakamura, R. Kawashima, N. Sato, A. Nakamura, M. Sugiura, T. Kato, K. Hatano, K. Ito, H. Fukuda, T. Schormann, and K. Zilles, “Functional delineation of the human occipito-temporal areas related to face and scene processing,” Brain, vol. 123, no. 9, pp. 1903 –1912, 2000.
-  R Epstein and N Kanwisher, “A cortical representation of the local visual environment,” Nature, vol. 392, no. 6676, pp. 598–601, 1998.
-  E A Maguire, “The retrosplenial contribution to human navigation: a review of lesion and neuroimaging findings,” Scandinavian Journal of Psychology, vol. 42, no. 3, pp. 225–238, 2001.
-  T Naselaris, R J Prenger, K N Kay, M Oliver, and J L Gallant, “Bayesian reconstruction of natural images from human brain activity.,” Neuron, vol. 63, no. 6, pp. 902–15, Sept. 2009.
-  R Girshick, J Donahue, T Darrell, and J Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” arXiv preprint arXiv:1311.2524, 2013.
-  J Sánchez, F Perronnin, T Mensink, and J Verbeek, “Image Classification with the Fisher Vector: Theory and Practice,” International Journal of Computer Vision, vol. 105, no. 3, pp. 222–245, June 2013.
-  T Naselaris, K N Kay, S Nishimoto, and J L Gallant, “Encoding and decoding in fMRI.,” NeuroImage, vol. 56, no. 2, pp. 400–10, May 2011.
-  S V David and J L Gallant, “Predicting neuronal responses during natural vision,” Network, vol. 16, no. 2-3, pp. 239–260, 2005.
-  D G Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, pp. 91–110, 2004.
“Caffe: An open source convolutional architecture for fast feature embedding,”http://caffe.berkeleyvision.org/, 2013.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
-  D H Hubel and T N Wiesel, “Receptive fields and functional architecture of monkey striate cortex,” The Journal of Physiology, vol. 195, no. 1, pp. 215–243, 1968.
-  C Enroth-Cugell and J G Robson, “The contrast sensitivity of retinal ganglion cells of the cat,” The Journal of Physiology, vol. 187, no. 3, pp. 517–552, 1966.
-  J L Gallant, J Braun, and D C Van Essen, “Selectivity for polar, hyperbolic, and cartesian gratings in macaque visual cortex,” Science, vol. 259, no. 5091, pp. 100–103, 1993.
-  T Naselaris, K N Kay, S Nishimoto, and J L Gallant, “Encoding and decoding in fMRI,” NeuroImage, vol. 56, no. 2, pp. 400–410, 2011.
-  A Oliva and A Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.
Jianxiong Xiao, James Hays, Krista a. Ehinger, Aude Oliva, and Antonio
“SUN database: Large-scale scene recognition from abbey to zoo,”
2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492, June 2010.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes (VOC) Challenge,” IJCV, vol. 88, 2010.
1 Testing Overlap in fMRI Stimulus Images and Imagenet Images
Recall that we trained the ConvNet with a learning database of images and labels taken from the ImageNet ILSVRC12 data set . Given that we used these features to construct encoding models of fMRI activity, it is thus important to ensure that that the stimuli used in the fMRI experiment do not significantly overlap with the images used to train the ConvNet, as this could result in trivial results.
We determined the amount that the stimulus images overlap with Imagnet by closely following the method proposed by . The method compares the distance between the GIST descriptors of images after resizing them to a size of . We found that there are 5 common images out of a total of 1386 images (1260 training + 126 validation), which is much less than 0.5% of all the images used for the fMRI experiment. This suggests that there is no substantial overlap between the stimuli and the images used to train the ConvNet.
2 Computing Significance Values
In this work we assess encoding model accuracy by calculating the correlation coefficient between the actual responses and responses predicted by the model. When analyzing thousands of models, as is done in this work, it is possible that high correlations can occur by chance. We would thus like to determine a significant level of correlation, when compared to a null distribution obtained by chance. To determine significant correlation, we use a permutation method.
Specifically, for each encoding model, we estimate the prediction accuracy based on 1000 permutations of the validation set responses, giving a distribution of null correlation values. We then take the upper (1-
-value)-th percentile of the null distribution as the threshold for significant correlation. The significance value is the probability of observing the correlation between actual and predicted response (without any shuffling) under the null distribution of predictions created by shuffled stimulus-response pairs (p-value).
3 Encoding Performance for S2
Fig1 compares the prediction accuracy of FV and ConvNet models with the 19-Cat model for subject S2. Fig 2 shows the explicit comparison between the FV and ConvNet models. The results are similar to the ones discussed in the main paper for S1.
4 ROI Clustering
In the main text we present results that show the existence of two functional sub-regions in EBA. Our method for identifying these sub-regions using clustering is as follows: First, we trained encoding models for each voxel in EBA based on all layers of the ConvNet and choose the model for each voxel that provides the most accurate predictions on the validation set. For EBA, voxel activity is generally best predicted by models based on layer fc-7 of the ConvNet. This layer defines a 4096-dimensional feature representation of the stimulus. Consequently, the encoding model weights are also 4096-dimensional vectors. We then concatenate the model weights for all EBA voxels and reduced the dimensionality of model weights from 4096 to 100 dimensions using PCA. We then run K-means on PC-reduce model weights.
4.1 Determining the number of clusters
The number of clusters () in the EBA analysis was determined using using two separate methods, each giving the same result. The two methods are described here:
Elbow Method: The intuition behind this method is that one should choose a number of clusters so that adding another cluster does not result in a significant decrease of the objective function optimized by the K-Means algorithm. We refer to the value of this objective function as the energy of the clustering. We calculate the clustering energy for values of ranging to 1 to 10. Figure 3 plots this energy for one subject. It can be seen that the clustering energy does not reduce significantly for values of greater than 2. This indicates that there are two distinct clusters in the data.
Entropy based method: The number of clusters is chosen based on the stability of results obtained across 100 bootstrapped repetitions of the clustering method. The intuition is that if the clustering is stable, then across different bootstrap runs, a particular voxel should consistently be assigned to the same cluster. Thus the stability of clustering can is characterized by calculating the entropy across the repeated clustering runs. Low average entropy across all voxels indicates stable clustering across the population. We thus calculate the average entropy across the population of voxels for = 2 to 10. The results are shown in Figure 3 (b). It can be clearly seen that values of greater than 2 result in high entropy and consequently cluster that are not stable. This suggests that there are two distinct and stable clusters in the data.
4.2 Average Cluster Model
The clustering in the PC-space results in an assignment of each voxel to a particular cluster. The average model for each cluster is the mean of model weights of all voxels assigned to that cluster.
5 Investigating Voxel Tuning
5.1 Data-Set Construction
5.2 Estimating Visual Receptive Fields
Recall that for the purpose of comparing the ConvNet model with FV/19-Cat model we estimated the best layer based on the performance of held-out part of the training set. Since, we are no longer interested in comparing performance - but in actually estimating functional/visual receptive fields, we use the model from the layer of the ConvNet which results into best predictions on the validation set of fMRI dataset to compute predicted brain activity for the data-set of images described in 5.1. Since, the validation set has 12 repeats for each image instead of 2 in the training set, it results into a higher SNR and consequently a more accurate representation of the brain activity. This choice for visualization simply reduces noise and does not affect any of our conclusion.