Sharing deep generative representation for perceived image reconstruction from human brain activity

Decoding human brain activities via functional magnetic resonance imaging (fMRI) has gained increasing attention in recent years. While encouraging results have been reported in brain states classification tasks, reconstructing the details of human visual experience still remains difficult. Two main challenges that hinder the development of effective models are the perplexing fMRI measurement noise and the high dimensionality of limited data instances. Existing methods generally suffer from one or both of these issues and yield dissatisfactory results. In this paper, we tackle this problem by casting the reconstruction of visual stimulus as the Bayesian inference of missing view in a multiview latent variable model. Sharing a common latent representation, our joint generative model of external stimulus and brain response is not only "deep" in extracting nonlinear features from visual images, but also powerful in capturing correlations among voxel activities of fMRI recordings. The nonlinearity and deep structure endow our model with strong representation ability, while the correlations of voxel activities are critical for suppressing noise and improving prediction. We devise an efficient variational Bayesian method to infer the latent variables and the model parameters. To further improve the reconstruction accuracy, the latent representations of testing instances are enforced to be close to that of their neighbours from the training set via posterior regularization. Experiments on three fMRI recording datasets demonstrate that our approach can more accurately reconstruct visual stimuli.



There are no comments yet.


page 7

page 8


Natural Image Reconstruction from fMRI using Deep Learning: A Survey

With the advent of brain imaging techniques and machine learning tools, ...

Constraint-free Natural Image Reconstruction from fMRI Signals Based on Convolutional Neural Network

In recent years, research on decoding brain activity based on functional...

Deep adversarial neural decoding

Here, we present a novel approach to solve the problem of reconstructing...

Locality and low-dimensions in the prediction of natural experience from fMRI

Functional Magnetic Resonance Imaging (fMRI) provides dynamical access i...

Accurate reconstruction of image stimuli from human fMRI based on the decoding model with capsule network architecture

In neuroscience, all kinds of computation models were designed to answer...

Neural encoding with visual attention

Visual perception is critically influenced by the focus of attention. Du...

From voxels to pixels and back: Self-supervision in natural-image reconstruction from fMRI

Reconstructing observed images from fMRI brain recordings is challenging...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Brain decoding, which aims to predict the information about external stimuli using brain activities, plays an important role in brain-machine interfaces (BMIs). Recent developments in this area have shown promising results [Schoenmakers et al.2015, Lee and Kuhl2016]. However, most previous researches only focus their attention on the prediction of the category of presented stimulus [Van Gerven et al.2010a, Ng and Abugharbieh2011, Damarla and Just2013, Elahe’Yargholi2016]. Accurate reconstruction of the visual stimuli from brain activities still lacks adequate examination and requires plenty of efforts to improve. Two main challenges that hinder the development of effective models are the perplexing measurement noise of functional magnetic resonance imaging (fMRI) and the high dimensionality of limited data instances. Existing methods generally suffer from one or both of these issues and yield dissatisfactory results.

Fujiwara et al. has proposed to use Bayesian canonical correlation analysis (BCCA) for building the reconstruction model, where image bases are automatically extracted from the measured data [Fujiwara et al.2013]

. As a latent variable model interpretation of non-probabilistic CCA, BCCA assumes linear observation model for visual images and spherical covariance for the Gaussian distribution of voxel activities. In practice, however, linear observation model for visual images has limited representation power, and spherical covariance can not capture the correlations among voxel activities. Since the measurement noises are ubiquitous in voxel activities, utilizing the correlations among voxel activities would be critical for suppressing noise and improving prediction performance.

On the other hand, introducing deep structure into multiview representation learning is attracting more and more attentions recently [Wang et al.2015, Chandar et al.2016]

. Deep canonically correlated autoencoders (DCCAE), which consists of two deep autoencoders and optimizes the combination of canonical correlation between the learned bottleneck representations and the reconstruction errors of the autoencoders, can extract nonlinear features from both views and reconstruct each view by the correlational bottleneck representations

[Wang et al.2015]. Nevertheless, DCCAE did not consider the cross-reconstruction between two views, which limits its effectiveness in applications where a missing view needs to be reconstructed from the existing one. To our knowledge, no deep multiview learning model with shared generative latent representation has been designed specifically for missing view reconstruction.

Focusing on these problems, we present a deep generative multiview model (DGMM), where we cast the reconstruction of perceived image as the Bayesian inference of the missing view. Sharing a common latent representation, DGMM allows us to generate visual images and fMRI activity patterns simultaneously. For visual images, unlike BCCA, we explore nonlinear observation models parameterized by deep neural networks (DNNs), which can be multi-layered perceptrons (MLPs) or convolutional neural networks (CNNs). This nonlinearity and deep structure endow our model with strong representation ability. For fMRI activity patterns, we adopt a full covariance matrix for the Gaussian distribution of voxel activities. While the full covariance matrix has the advantage of capturing the correlations among voxels, it results in severe computational issues. To reduce the complexity, we impose a low-rank assumption on the covariance matrix. This is beneficial to suppressing noise and improving prediction performance. Furthermore, we devise an efficient mean-field variational inference method to infer the latent variables and the model parameters. To further improve the reconstruction accuracy, the latent representations of testing instances are enforced to be close to that of their neighbours from the training set via posterior regularization

[Zhu et al.2014]. Compared with the non-probabilistic deep multiview representation learning models mentioned above [Wang et al.2015, Chandar et al.2016], our Bayesian model has the inherent advantage of avoiding overfitting to small training set by model averaging. Finally, extensive experimental comparisons on three fMRI recording datasets demonstrate that our approach can reconstruct visual images from fMRI measurements more accurately.

2 Related work

In the literature of brain decoding, there are a relatively limited number of studies reporting perceived image reconstructions to date. Miyawaki et al. reconstructed the lower-order information such as binary contrast patterns using a combination of multi-scale local image bases whose shapes are predefined [Miyawaki et al.2008]. Van Gerven et al.

reconstructed handwritten digits using deep belief networks

[Van Gerven et al.2010b]. Schoenmakers et al. reconstructed handwritten characters using a straightforward linear Gaussian approach [Schoenmakers et al.2013]. Fujiwara et al.

proposed to build the reconstruction model in which image bases can be automatically estimated by Bayesian canonical correlation analysis (BCCA)

[Fujiwara et al.2013]. In addition, there are works trying to reconstruct movie clips [Nishimoto et al.2011, Haiguang Wen and Liu2016].

Though a similar strategy to our work has been used by Fujiwara et al. [Fujiwara et al.2013] for visual image reconstruction, its linear observation model for visual images has limited representation power in practice. Several recently proposed deep multiview representation learning models can provide a service to visual image reconstruction [Wang et al.2015, Chandar et al.2016]. For example, deep canonically correlated autoencoders (DCCAE) with nonlinear observation models for both views has good ability to learn deep correlational representations and reconstruct each view using the learned representations respectively [Wang et al.2015]. Compared with DCCAE, correlational neural networks (CorrNet) further considered the cross-reconstruction between two views [Chandar et al.2016]. However, directly applying the nonlinear maps of DCCAE and CorrNet to limited noisy brain activities is prone to overfitting.

Inspired by recent developments in deep generative models such as variational autoencoders (VAE) [Kingma and Welling2014]

, we present a deep generative multiview model (DGMM), which can be viewed as a nonlinear extension of the linear method BCCA. To the best of our knowledge, this paper is the first to study visual image reconstruction via Bayesian deep learning.

3 Perceived image reconstruction with DGMM

In this section, we cast the reconstruction of perceived images from human brain activity as the Bayesian inference of missing view in a multiview latent variable model.

Assume the training set consists of paired observations from two distinct views (), denoted by () (), where is the training set size, and for . Here and denote the visual images and fMRI activity patterns, respectively. The presence of paired two-view data presents an opportunity to learn better representations by analyzing both views simultaneously. Therefore, we introduce the shared latent variables to relate the visual images to the fMRI activity patterns . The shared latent variables are treated as the following Gaussian prior distribution,


Since the visual image and associated fMRI activity pattern are assumed to be generated from the same latent variables, we have two likelihood functions. One is for visual images, and the other is for fMRI activity patterns.

3.1 Deep generative model for perceived images

When observation noises for image pixels are assumed to follow a Gaussian distribution with zero mean and diagonal covariance, the likelihood function of visual images is


where the mean and covariance are nonlinear functions of the latent variables

. To allow for second moment of the data to be captured by the density model, we choose these nonlinear functions to be deep neural networks (DNNs), which is refer to as the

generative network, parameterized by . Here the DNNs can be multi-layered perceptrons (MLPs) or convolutional neural networks (CNNs). Compared with linear observation model, DNNs can extract nonlinear features from visual images and capture the stages of human visual processing from early visual areas towards the ventral streams [Güçlü and van Gerven2015, Cichy et al.2016]. This nonlinearity and deep structure endow our model with strong representation ability.

3.2 Generative model for fMRI activity patterns

fMRI voxels are generally highly correlated, and the correlation can carry relevant information about stimuli or tasks, even in the absence of information in individual voxels [Yamashita et al.2008, Hossein-Zadeh and others2016]. However, most existing methods [Fujiwara et al.2013, Schoenmakers et al.2013]

simply assume a spherical or diagonal covariance for the Gaussian distribution of voxel activities thus ignoring any correlations among voxels. Unlike them, we assume the observation noises of voxel activities follow a Gaussian distribution with zero mean and full covariance matrix. While this difference might seem minor, it is critical for the model to be able to suppress noise and improve prediction performance. In addition, although nonlinear transformations for fMRI activity patterns are more powerful than linear transformations (in terms of the types of features they can learn), extant multi-voxel pattern analysis (MVPA) studies have not found a clear performance benefit for nonlinear versus linear transformations. Therefore, we assume the likelihood function of fMRI activity patterns is


The model should be further complemented with priors for the projection matrix and the covariance matrix . Popular choices would be automatic relevance determination (ARD) prior and Wishart distribution for and , respectively,



denotes gamma distribution with shape parameter

and rate parameter , and

denote the scale matrix and degrees of freedom for Wishart distribution, respectively.

Figure 1: Graphical models for DGMM. (a) Directly uses the full covariance matrix . (b) Imposing a low-rank assumption on ().

While the above model has the advantage of capturing the correlations among voxels, it results in severe computational issues (the cost is cubic as a function of ). Fortunately, the problem of inferring high-dimensional covariance matrix can be solved by introducing auxiliary latent variables [Archambeau and Bach2009],


and rewriting the likelihood function in Eq.(3) as


where ARD prior and simple gamma prior can be set for the extra projection matrix

and variance parameter

, respectively,


The graphical models of DGMM are shown in Fig.1. Note that sparsity of the projection matrices and can be tuned by assigning suitable values to the hyper-parameters and , respectively.

By integrating out auxiliary latent variables , Eq.(6) can be shown to be equivalent to imposing a low-rank assumption on the covariance matrix in Eq.(3) (), which allows decreasing the computational complexity. From another perspective, this low-rank assumption produces a full factorization of the variation in fMRI data into shared components and private components . The ability to identify what is shared and what is non-shared makes our model be good at suppressing noise and improving prediction performance.

As short-hand notations, all hyper-parameters in the model will be denoted by , while the priors by and the remaining variables by . Dependence on is omitted for clarity throughout the paper. Then we can get the following posterior distribution using Bayes’ rule


where is the normalization constant.

4 Variational posterior inference

Given above generative model, exact inference is intractable. Here we formulate a mean-field variational approximate inference method to infer the latent variables and model parameters. Specifically, we assume there are a family of factorable and free-form (except for ) variational distributions

and define as a product of multivariate Gaussian distributions with diagonal covariance111We also considered to condition the posterior distribution on both and , but we didn’t observe obvious performance improvement., i.e.,

where the mean and covariance are outputs of the recognition network specified by another DNN with parameters . Then the objective is to get the optimal one which minimizes the Kullback-Leibler (KL) divergence between the approximating distribution and the target posterior, i.e.,


is the space of probability distributions. Equivalently, we can also bound the marginal likelihood:


where we used the fact that KL divergence is guaranteed to be non-negative, and

Intuitively, and can be interpreted as the (negative) expected reconstruction errors of visual images and fMRI activity patterns, respectively. Maximizing this lower bound strikes a balance between minimizing reconstruction errors of two views and minimizing the KL divergence between the approximate posterior and the prior.

4.1 Learning , and

Given the fixed-form approximate posterior distribution for factor , can be computed exactly as:

On the other hand, and can be approximated by Monte-Carlo sampling[Kingma and Welling2014, Kingma et al.2014]. Instead of sampling directly from , is computed as a deterministic function of and some noise term such that has the desired distribution. Assuming we draw samples, () can be expressed as

where and denotes element-wise multiplication. Then the resulting Monte-Carlo approximations are

Finally, the parameters of DNNs ( and ) can be obtained by optimizing the objective function

(based on minibatches) using the standard stochastic gradient based optimization methods such as SGD, RMSprop or AdaGrad

[Duchi et al.2011].

4.2 Learning and

For a specific factor (except for ), it can be shown that when keeping all other factors fixed the optimal distribution satisfies

For our model, thanks to the conjugacy, the resulting optimal distribution of each factor follows the same distribution as the corresponding factor.

The optimal distributions of the projection parameters can be found as a product of multivariate Gaussian distributions:


where notation denotes the expectation operator, i.e., means the expectation of over its current optimal distribution, and

The optimal distribution of the auxiliary latent variables can also be found as a product of multivariate Gaussian distributions:



The optimal distributions of the precision variables can be formulated as:


where .

4.3 Convergence

The inference mechanism sequentially updates the optimal distributions of the latent variables and the model parameters until convergence, which is guaranteed because the KL divergence is convex with respect to each of the factors.

4.4 Prediction

Using the estimated parameters, we can derive the predictive distribution for a visual image given a new brain activity . The predictive distribution can be formulated as follows,


where the posterior distribution of latent variables can be derived by


The posterior distribution can be equivalently obtained by solving the following information theoretical optimization problem:


Expanding Eq.(15) and ignoring the term unrelated to , we further get

To ensure the latent representations of testing instances are close to that of their neighbours from the training set, we adopt the posterior regularization[Zhu et al.2014]

strategy to incorporate the manifold regularization into the above posterior predictive distribution

. Specifically, we define the following expected manifold regularization:

where is some similarity measure of instances and . Here we use a k-nearest neighbor graph to effectively model local geometry structure in the input space and the affinity graph is defined as:

where denotes the k-nearest neighbors of .

Then our posterior regularization strategy can be formulated as


where the parameter controls the expected scale. As a direct way to impose constraints and incorporate knowledge in Bayesian models, posterior regularization is more natural and general than specially designed priors. However, directly solving Eq.(4.4) with is difficult and inefficient. Let

then Eq.(4.4) can be rewritten as


Solving problem Eq.(4.4), we can get the posterior distribution


Because the multiple integral over the random variables

, , and is intractable, we replace the random variables , and with the mean of estimated optimal distributions and , respectively, to vanish the integral over , and . Then becomes


Now the posterior distribution can be found as:



However, with the likelihood of the visual image formulated by a DNN, the integral over the latent variables (Eq.(13)) can not be computed analytically. Similar as in the training phase, we can approximate this integral by Monte-Carlo sampling. Finally, the reconstructed visual image is calculated by taking the mean of all predictions, i.e., , where is the outputs of the generative network, i.e., .

5 Experiments

In this section, we present extensive experimental results on fMRI recording datasets to demonstrate the effectiveness of the proposed framework for perceived image reconstruction from human brain activity. Specifically, we compare our DGMM with the following algorithms, which use either a shallow or a deep architecture:

  • (Miyawaki et al.): a specially designed method to reconstruct visual images by combining local image bases of multiple scales (, and pixels covering an entire image) [Miyawaki et al.2008]. The shapes of these predefined images bases are fixed, thus it may not be optimal for image reconstruction.

  • (BCCA): a probabilistic extension of CCA model that relates the fMRI activity space to the visual image space via a set of latent variables [Fujiwara et al.2013]. BCCA assumes a linear observation model for visual images and a spherical covariance for the Gaussian distribution of fMRI voxels.

  • (DCCAE): a latest deep multi-view representation learning model that consists of two autoencoders and optimizes the combination of canonical correlation between the learned bottleneck representations and the reconstruction errors of the autoencoders [Wang et al.2015]. DCCAE do not consider the cross-reconstruction errors between two views.

  • (De-CNN): a latest neural decoding method based on multivariate linear regression and deconvolutional neural network

    [Haiguang Wen and Liu2016, Zeiler et al.2011]. It is a two-stage cascade model, i.e., it first predicts feature-maps by multivariate linear regression, then reconstruct images by feeding the estimated feature-maps in a pre-trained deconvolutional neural network.

5.1 Experimental testbed and setup

Data description.  We conducted experiments on three public fMRI datasets obtained from Miyawaki et al. [Miyawaki et al.2008] and van Gerven [Van Gerven et al.2010b, Schoenmakers et al.2013]. Dataset 1, consisting of contrast-defined patches, contains two independent sessions [Miyawaki et al.2008]. One is a ‘random image session’, in which spatially random patterns were sequentially presented. The other is a ‘figure image session’, in which alphabetical letters and simple geometric shapes were sequentially presented. We used fMRI data from primary visual area V1 of subject 1 (S1) for the analysis. Note that all comparing algorithms were trained on the data from ‘random image session’ and evaluated on the data from ‘figure image session’. Dataset 2 contains a hundred handwritten gray-scale digits (equal number of 6s and 9s) at a

pixel resolution taken from the training set of the MNIST database and the fMRI data from V1, V2 and V3

[Van Gerven et al.2010b]. Dataset 3 contains 360 gray-scale handwritten characters (equal number of Bs, Rs, As, Is, Ns, and Ss) at a pixel resolution taken from [Van der Maaten2009] and the fMRI data of V1, V2 taken from three subjects [Schoenmakers et al.2013]. The visual images were downsampled from pixels to pixels in our experiments. The details of the 3 data sets used in our experiments had been summarized in Table 1. See [Miyawaki et al.2008, Van Gerven et al.2010b, Schoenmakers et al.2013] for more information, including fMRI data acquisition and preprocessing.

Datasets #Instances #Pixels #Voxels #ROIs #Training
Dataset 1 1400 100 797 V1 1320
Dataset 2 100 784 3092 V1, V2, V3 90
Dataset 3 360 784 2420 V1, V2 330
Table 1: The details of the 3 data sets used in our experiments.

Voxel selection.  Voxel selection is an important component to fMRI brain decoding because many voxels may not respond to the visual stimulus. A common approach is to choose those voxels that are maximally correlated with the visual images during training. We chose voxels for which the model provided better predictability (encoding performance). This codifies our intuition that the voxels better predicted with the visual images are those to be included in the decoding model. The goodness-of-fit between model predictions and measured voxel activities was quantified using the coefficient of determination () which indicates the percentage of variance that is explained by the model. In experiments, we first computed the of each voxel using 10-fold cross-validation on training data, then voxels with positive were selected for further analysis.

Parameter setting.  The hyper-parameters of the proposed DGMM were set to for all data sets, while 5-fold cross validation was conducted on training sets to choose better regularization parameters from . For fair comparison, model parameters of other methods had also been tuned carefully. In our experiments, we considered multiple layer perceptrons (MLPs) as the type of recognition models. Inspired by the selectivity of visual areas to feature maps of varying complexity [Güçlü and van Gerven2015, Haiguang Wen and Liu2016], we set the structures of the recognition network for visual images as ‘100-200’, ‘784-256-128-10’ and ‘784-256-128-5’ for three data sets, respectively. Specially, we considered two types of the structures for DCCAE. One has an asymmetric shape (same setup as our model for image view and a single layer setup for fMRI view, DCCAE-A), which can mimic our model in structure and function. The other one has a symmetric shape (same setup for both views, DCCAE-S), which can explore the deep nonlinear maps for fMRI data.

5.2 Performance evaluation

The reconstructed geometric shapes and alphabet letters, handwritten digits and handwritten characters by the proposed DGMM and other algorithms were shown in Fig.2, Fig.3 and Fig.4, respectively, where the first row denote presented images, and below rows are the reconstructed images obtained from all comparing algorithms.

Figure 2: Image reconstructions of geometric shapes and alphabet letters taken from Dataset 1.
Figure 3: Image reconstructions of 10 distinct handwritten digits taken from Dataset 2.
Figure 4: Examples of reconstructed 18 distinct handwritten characters taken from subject 3 of Dataset 3.

Overall, the images reconstructed by DGMM captured the essential features of the presented images. In particular, they showed fine reconstructions for handwritten digits and characters. Although the reconstructed geometric shapes and alphabet letters had some noise in the peripheral regions, the main shapes can be clearly distinguished. With the obtained reconstructions of handwritten digits and characters shared certain characteristics of their corresponding original images, there are subtle differences in the strokes. We attribute this phenomenon to the fact that manifold regularization imposed on the latent representations may change the details of reconstructed images. On the contrast, images reconstructed by Miyawaki’s method and BCCA were coarse for all image types with noise scattered over the entire reconstructed image. Also, both DCCAE-S and DCCAE-A produced disappointing reconstructions which often lacked shapes of the presented images, especially for geometric shapes and alphabet letters. This might be due to the fact that nonlinear maps will easily over-fit the voxel activities.

To evaluate the reconstruction performance quantitatively, we used several standard image similarity metrics, including Pearson’s correlation coefficient (PCC), mean squared error (MSE) and structural similarity index (SSIM)

[Wang et al.2004]

. Note that MSE is not highly indicative of perceived similarity, while SSIM can address this shortcoming by taking texture into account. In addition, we also performed image classification analysis to quantify the reconstruction accuracy from another perspective. Specifically, linear support vector machine (SVM) and convolutional neural network (CNN) which had been trained on the presented visual images were used as the classifiers to label the reconstructed images. The classification accuracy of SVM (ACC-SVM) and CNN (ACC-CNN) on reconstructed images were reported. Performance comparisons were listed in Table

2. Note that we also listed the time consumed in training phase for all comparing algorithms in the last column for reference. Several observations can be drawn as follows.

Datasets Algorithms PCC MSE SSIM ACC-SVM ACC-CNN Time(s)
Dataset 1 Miyawaki et al. .609.151 .162.025 .237.105  19.41.1
BCCA .438.215 .253.051 .181.066  74.93.0
DCCAE-A .455.113 .234.029 .166.025 211.87.5
DCCAE-S .401.100 .240.027 .175.011 254.99.8
De-CNN .469.149 .263.067 .224.129 108.22.2
DGMM .611.183 .159.112 .268.106 118.42.5
Dataset 2 Miyawaki et al. .767.033 .042.007 .466.030 1.00 1.00  39.91.2
BCCA .411.157 .119.017 .192.035 1.00 1.00  20.71.0
DCCAE-A .548.044 .074.010 .358.097 .900 .967.047  12.70.3
DCCAE-S .511.057 .080.016 .552.088 1.00 1.00  19.40.8
De-CNN .799.062 .038.010 .613.043 1.00 1.00  35.81.2
DGMM .803.063 .037.014 .645.054 1.00 1.00  18.61.2
Dataset 3 Miyawaki et al. .481.096 .067.026 .191.043 .655.193 .655.113 128.14.6
BCCA .348.138 .128.049 .058.042 .633.034 .600.098  32.91.0
DCCAE-A .354.167 .073.036 .186.234 .478.126 .533.072  38.11.1
DCCAE-S .351.153 .086.031 .179.117 .478.051 .478.155  59.51.8
De-CNN .470.149 .084.035 .322.118 .589.135 .611.128  96.82.0
DGMM .498.193 .058.031 .340.051 .767.115 .778.083  42.44.2
Table 2: Performance of several image reconstruction methods on the test sets. Results were averaged over 20 random seeds and all subjects (meanstd). The best performance on each dataset was highlighted.

First, by comparing DGMM against the other algorithms, we can find that DGMM performs considerably better on all three data sets. In particular, the SSIM values of DGMM significantly surpass the baseline algorithms in all cases.

Second, by examining DGMM against BCCA which has a linear observation model for visual images, we can find that DGMM always outperform BCCA. This encouraging result shows that the DGMM with a DNN model for visual images is able to extract nonlinear features from visual images.

Third, DGMM shows obvious better performance than DCCAE-A and DCCAE-S. Except for ignoring cross-reconstructions, it is also caused by the fact that a linear map between voxel activities and bottleneck representation is enough to achieve good performance, while the nonlinear maps are easily overfitting under the high dimensionality of limited fMRI data instances.

Fourth, the performance of De-CNN is moderate for all data sets. We attribute this to the fact that it is a two-stage method which can’t obtain the global optimal result of model parameters.

Finally, nearly correct classification is possible for each algorithm on Dataset 2. We believe that it is caused by the fact that digit 6 and 9 are easily to distinguish from each other. On Dataset 3, the remarkably higher classification performance on the images reconstructed by our model demonstrates the superiority of the proposed DGMM again.

6 Conclusion and future works

We have proposed a deep generative multiview framework to tackle the perceived image reconstruction problem. In our framework, multiple correspondences between visual image pixels and fMRI voxels can be found via a set of latent variables. We also derived a predictive distribution that succeeded in reconstructing visual images from brain activity patterns. Although we focused on visual image reconstruction problem in this paper, our framework can also deal with brain encoding tasks. Extensive experimental studies have confirmed the superiority of the proposed framework.

Two challenging and promising directions can be considered in the future. First, considering the recurrent neural networks (RNNs)

[Chung et al.2015] in our framework, we can explore the reconstruction of dynamic vision. Second, considering each subject’s fMRI measurements as one view, we can explore multi-subject decoding.


This work was supported by National Natural Science Foundation of China (No. 91520202, 61602449) and Youth Innovation Promotion Association CAS.


  • [Archambeau and Bach2009] Cédric Archambeau and Francis R Bach. Sparse probabilistic projections. In NIPS, pages 73–80, 2009.
  • [Chandar et al.2016] Sarath Chandar, Mitesh M Khapra, Hugo Larochelle, and Balaraman Ravindran. Correlational neural networks. Neural computation, 2016.
  • [Chung et al.2015] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In NIPS, pages 2980–2988, 2015.
  • [Cichy et al.2016] Radoslaw Martin Cichy, Aditya Khosla, Dimitrios Pantazis, Antonio Torralba, and Aude Oliva. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific reports, 6, 2016.
  • [Damarla and Just2013] Saudamini Roy Damarla and Marcel Adam Just. Decoding the representation of numerical values from brain activation patterns. Human brain mapping, 34(10):2624–2634, 2013.
  • [Duchi et al.2011] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.

    Journal of Machine Learning Research

    , 12(Jul):2121–2159, 2011.
  • [Elahe’Yargholi2016] Gholam-Ali Hossein-Zadeh Elahe’Yargholi.

    Brain decoding-classification of hand written digits from fmri data employing bayesian networks.

    Frontiers in Human Neuroscience, 10, 2016.
  • [Fujiwara et al.2013] Yusuke Fujiwara, Yoichi Miyawaki, and Yukiyasu Kamitani. Modular encoding and decoding models derived from bayesian canonical correlation analysis. Neural computation, 25(4):979–1005, 2013.
  • [Güçlü and van Gerven2015] Umut Güçlü and Marcel A. J. van Gerven. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience, 35(27):10005–10014, 2015.
  • [Haiguang Wen and Liu2016] Yizhen Zhang Kun-Han Lu Haiguang Wen, Junxing Shi and Zhongming Liu. Neural encoding and decoding with deep learning for dynamic natural vision. arXiv:1608.03425v1, 2016.
  • [Hossein-Zadeh and others2016] Gholam-Ali Hossein-Zadeh et al. Reconstruction of digit images from human brain fmri activity through connectivity informed bayesian networks. Journal of neuroscience methods, 257:159–167, 2016.
  • [Kingma and Welling2014] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • [Kingma et al.2014] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In NIPS, pages 3581–3589, 2014.
  • [Lee and Kuhl2016] Hongmi Lee and Brice A Kuhl. Reconstructing perceived and retrieved faces from activity patterns in lateral parietal cortex. The Journal of Neuroscience, 36(22):6069–6082, 2016.
  • [Miyawaki et al.2008] Yoichi Miyawaki, Hajime Uchida, Okito Yamashita, Masa-aki Sato, Yusuke Morito, Hiroki C Tanabe, Norihiro Sadato, and Yukiyasu Kamitani. Visual image reconstruction from human brain activity using a combination of multiscale local image decoders. Neuron, 60(5):915–929, 2008.
  • [Ng and Abugharbieh2011] Bernard Ng and Rafeef Abugharbieh. Generalized group sparse classifiers with application in fmri brain decoding. In CVPR, pages 1065–1071, 2011.
  • [Nishimoto et al.2011] Shinji Nishimoto, An T Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, and Jack L Gallant. Reconstructing visual experiences from brain activity evoked by natural movies. Current Biology, 21(19):1641–1646, 2011.
  • [Schoenmakers et al.2013] Sanne Schoenmakers, Markus Barth, Tom Heskes, and Marcel van Gerven. Linear reconstruction of perceived images from human brain activity. NeuroImage, 83:951–961, 2013.
  • [Schoenmakers et al.2015] Sanne Schoenmakers, Umut Güçlü, Marcel Van Gerven, and Tom Heskes. Gaussian mixture models and semantic gating improve reconstructions from human brain activity. Frontiers in computational neuroscience, 8, 2015.
  • [Van der Maaten2009] Laurens Van der Maaten. A new benchmark dataset for handwritten character recognition. Tilburg University, pages 2–5, 2009.
  • [Van Gerven et al.2010a] Marcel AJ Van Gerven, Botond Cseke, Floris P De Lange, and Tom Heskes. Efficient bayesian multivariate fmri analysis using a sparsifying spatio-temporal prior. NeuroImage, 50(1):150–161, 2010.
  • [Van Gerven et al.2010b] Marcel AJ Van Gerven, Floris P De Lange, and Tom Heskes. Neural decoding with hierarchical generative models. Neural computation, 22(12):3127–3142, 2010.
  • [Wang et al.2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • [Wang et al.2015] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view representation learning. In ICML, pages 1083–1092, 2015.
  • [Yamashita et al.2008] Okito Yamashita, Masa-aki Sato, Taku Yoshioka, Frank Tong, and Yukiyasu Kamitani. Sparse estimation automatically selects voxels relevant for the decoding of fmri activity patterns. NeuroImage, 42(4):1414–1429, 2008.
  • [Zeiler et al.2011] Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In ICCV, pages 2018–2025, 2011.
  • [Zhu et al.2014] Jun Zhu, Ning Chen, and Eric P Xing. Bayesian inference with posterior regularization and applications to infinite latent svms. Journal of Machine Learning Research, 15(1):1799–1847, 2014.