1 Introduction
Recently, a paper published in Cell [Chang and Tsao2017] reports an interesting discovery about the neurons in the middle lateral (ML)/middle fundus (MF) and anterior medial (AM) areas of the primate brain that are responsible for face recognition. Specifically, the paper is concerned with how these neurons respond to and encode the face stimuli generated by a pretrained Active Appearance Model (AAM) [Cootes et al.2001, Cootes et al.2015]. In AAM, there are explicit shape variables and appearance variables that generate the positions of the control points and the nominal face image respectively, and the output image is then generated by wrapping the nominal face image using the control points. [Chang and Tsao2017] discovers that the responses of the aforementioned neurons to the face image generated by the AAM exhibits strong linear relationship with the shape and appearance variables of the AAM that generates the face image. In fact, the shape and appearance variables of the AAM can be recovered from the neuron responses so that the face image can be reconstructed by the AAM using the recovered shape and appearance variables.
In this paper, we investigate whether the above phenomenon can be replicated by deep generative models. In particular, we focus on a popular deep generative model called the generator network [Goodfellow et al.2014]
, which can be considered a nonlinear generalization of the factor analysis model. Recall that in the factor analysis model, the signal is generated by latent factors that are assumed to be independent Gaussian random variables, and the signal is a linear transformation of the latent variables (plus observational noises). In the generator network, the latent variables still follow a simple known prior distribution such as independent Gaussian or uniform distribution, but the mapping from the latent variables to the observed signal is modeled by a convolutional neural network (ConvNet), which has proven to be an exceedingly powerful approximator of highdimensional nonlinear mappings.
Both the AAM and the generator network are latent variable models where the signal is obtained by transforming the latent variables. In the AAM, the latent variables consist of explicit shape variables and appearance variables, which generate the control points and the appearance image by linear mappings learned by principal component analysis (PCA). The output image is generated by a highly nonlinear but known warping function of the control points and the nominal image. In contrast, the generator network is more generic, in that it does not assume any prior knowledge about shape and deformation, and it does not have any explicit shape variables and shape model. We are interested in whether the generator model can replicate the AAM in the sense that whether the generator network can learn from the images generated by a pretrained AAM, so that the latent variables of the learned generator network are closely related to the latent shape and appearance variables of the AAM, and the nonlinear mapping from the latent variables to the output image in the generator network accounts for the highly nonlinear warping function of the AAM. As it is impossible for the latent variables of the learned generator network to be the same as the latent variables of the AAM, a strong linear relationship between the two sets of latent variables (or codes) is the best we can hope for. We shall show that such a linear relationship indeed exists, thus qualitatively reproducing the behavior of the neuron responses (or neural code) observed by
[Chang and Tsao2017].The generator network can be trained by various methods, including the wakesleep algorithm [Hinton et al.1995]
, variational autoencoder (VAE)
[Kingma and Welling2014, Rezende et al.2014, Salimans et al.2015], generative adversarial networks (GAN) [Goodfellow et al.2014, Radford et al.2016, Denton et al.2015], moment matching networks
[Li et al.2015], alternating backpropagation (ABP) [Han et al.2017], and other related methods [Oord et al.2016, Dinh et al.2016]. They have led to impressive results in a wide range of applications, such as image/video synthesis [Dosovitskiy et al.2015], disentangled feature learning [Chen et al.2016, Higgins et al.2016] and pattern completion [Han et al.2017] etc.In this paper, we shall adopt the VAE method to train the generator network. Unlike GAN, the VAE complements the generator network with an inference network that transforms the observed image to the latent variables. The inference network seeks to approximate the posterior distribution of the latent variables given the observed image. The inference network and the generator network form an autoencoder, where the inference network plays the role of the encoder that encodes the signal into the latent variables (or latent code), and the generator network plays the role of the decoder that decode the latent variables (or latent code) back to the signal. The parameters of the two networks can be learned by maximizing a variational lower bound of the loglikelihood [Blei et al.2017]. We show that the latent variables computed by the inference network from the observed face image are highly correlated with the latent variables of the AAM that generates the face image.
Contributions.
This paper is phenomenological in nature. It is our hope that the paper is of interest to both the neuroscience community and the deep learning community. The followings are the contributions of this paper:

We study the linear relationship between the latent code learned by the generator network and the AAM code that generates the face stimuli. Our experiments suggest that the deep generative model exhibits similar behavior as the primate neural system.

We show that the latent variables learned by the generator network can be separated into shaperelated part and appearancerelated part, and the generator network is expressive enough to replicate the AAM model.
2 Active Appearance Model (AAM)
The active appearance model [Cootes et al.2001, Cootes et al.2015] is a generative model for representing face images. It has a shape model and an appearance model. Both models are learned by principal component analysis (PCA).
Shape model: The shape model is based on a set of landmarks or control points. In the training stage, the control points are given for each training image. Let denote the coordinates of all the control points. The shape model is
(1) 
where is the average shape,
is the matrix of eigenvectors, and
is the vector of shape variables.
are shared across all the training examples, while and are different for different examples. The model can be learned from the given control points of the training images by PCA, where the number of eigenvectors is determined empirically.Appearance model: The appearance model generates the nominal image before shape deformation. To learn the model, we can wrap each training image to the shapenormalized image so that its control points match those of the mean shape . Then a PCA is performed on the shapenormalized training images. Let denote the vector of the greylevel image. The appearance model is
(2) 
where is the mean normalized greylevel image, is the matrix of eigenvectors, and is the vector of appearance variables. are shared by all the training examples, while and are different for different examples.
We can learn from the training images with given control points. We concatenate the shape and appearance variables to form the face representation or the latent code, i.e., . Given , we can generate face image by generating and first, and then warping according to using a warping function to output the image . The warping function is given and is highly nonlinear in and .
3 Generator Network
The generator network is a deep generative model of the following form:
(3)  
(4) 
where is the vector of latent variables (or latent code), is the dimension of , i.e., the number of latent variables. is assumed to follow a simple prior distribution where each component is a Gaussian random variable ( is the
dimensional identity matrix). The latent vector
generates the output image by a nonlinear mapping , which is modeled by a topdown convolutional neural network (ConvNet), where collects all the weight and bias parameters of the topdown ConvNet. is the noise vector whose elements are independent random variables. Even though follows a simple distribution, the model can generate with very complex distribution and with very rich patterns because of the expressiveness of . The generator model (3) is a generalization of the factor analysis model, where the mapping from to is assumed linear.Compared to the AAM, the generator network has no explicit shape model such as (1) with control points and shape variables , nor does it have the explicit nonlinear warping function . The generator network relies on the highly expressive ConvNet to account for the linear shape model and the nonlinear warping function. Even though no prior knowledge of shape and warping is built into the generator network, it can learn such knowledge by itself.
Specifically, we shall use a pretrained AAM as a teacher model, and we let the generator network be the student model. The AAM generates training images, and the generator network learns from the training images. We shall show that the inferred from the face image has a strong linear relationship with the corresponding that the AAM uses to generate .
4 Variational autoencoder (VAE)
Given a set of training images generated by AAM, we train the generator network by variational autoencoder (VAE) [Kingma and Welling2014, Rezende et al.2014, Salimans et al.2015]. Let be the prior distribution of . Let be the conditional distribution of the image given the latent vector . Then the marginal distribution of is . The loglikelihood is , and in principle
can be estimated by maximizing the loglikelihood. However, this is intractable because
involves intractable integral. The EM algorithm [Dempster et al.1977] is also impractical because the posterior distribution is intractable. The basic idea of VAE is to approximate the posterior distribution by a tractable inference model with a separate set of parameters, such as a Gaussian distribution with independent components
, where is the vector of means of the components of , andis the vector of variances of the components of
. Both and can be modeled by bottomup ConvNets.The parameters can be learned by jointly maximizing the variational lower bound of the loglikelihood
(5) 
where
denotes the KullbackLeibler divergence from
to . is computationally tractable as long as the inference model is tractable. See [Kingma and Welling2013] for more details. is the encoder, and is the decoder. After learning , we can estimate from by the learned posterior mean vector . In our work, we use as the code of .We can understand VAE as follows. Let be the data distribution. Then the maximum likelihood is equivalent to minimizing over . VAE is equivalent to minimizing
(6)  
(7) 
over both and , where and . Unlike the maximum likelihood objective function , which is the KL divergence between the marginal distributions, the variational objective function
is the KL divergence between the joint distributions. While the marginal distribution
is intractable, the joint distribution is tractable.Define and be the two families of joint distributions. We can view VAE as the joint minimization of over and . Such joint minimization can be accomplished by alternating projection as illustrated by Figure 1, where and are illustrated by two lines, and each distribution in and is illustrated by a point. Starting from , we project onto by minimizing over to obtain . Then we project onto by minimizing over to obtain , and so on. This process will converge to a local minimum of . In Figure 1, the two projections are illustrated by two different colors, because they are of different natures. is a variational projection that minimizes over the first argument, while is a model fitting projection that minimizes over the second argument. As is commonly known, the former has mode seeking behavior while the latter has moment matching behavior.
A precursor to VAE is the wakesleep algorithm [Hinton et al.1995], which amounts to replacing minimizing over by minimizing over by switching the order of and . The minimization of can be accomplished by generating data from in the sleep phase and learn from the generated data. Because of the switched order, the wakesleep algorithm does not have a single objective function. However, in wakesleep algorithm, both projections are of the model fitting type.
The generator network can also be trained by GAN [Goodfellow et al.2014]. However, GAN does not have an inference model or an encoder, which is crucial for our work.
5 Experiments
We conduct experiments to investigate whether the generator network can replicate or imitate the AAM, where the AAM serves as the teacher model and the generator network plays the role of the student model. In the learning stage, the generator network only has access to the images generated by the AAM. It does not have access to the shape and appearance variables (latent code) used by the AAM to generate the images. After learning the generator network, we investigate the relationship between the latent code of the learned generator network and the latent code of the AAM.
5.1 Experiment Setting
Data Generation. We pretrain the AAM using approximately 200 frontal face images with given landmarks or control points. Coordinates of the landmarks are first averaged, then PCA is performed where the first 10 principal components (PCs) for shape (see Eqn 1) are retained. The landmarks of each training image are then smoothly morphed into the average shape, so that the resulting image only carries shapefree appearance information. Another PCA is then performed on the shapenormalized training images, where the first 10 PCs for appearance (see Eqn 2) are retained. This results in a 20dimensional latent face space, where every point represents a face. In other words, every face has a corresponding 20dimensional AAM code denoted as , which encodes its shape and appearance variables.
To generate face stimuli for our experiments, we randomly generate
face images from the above pretrained AAM. Specifically, for each dimension of the latent code, we record the standard deviation of the training responses of that dimension, and sample the variable from the Gaussian distribution with the same standard deviation as the real training faces. After that, these sampled variables are combined with the learned eigenvectors
and to generate the synthesized images. The obtained images are then used as our training data for the generator network. Figure 2 shows some examples of training images to pretrain the AAM, and the synthesized face images generated by the trained AAM.VAE Training. The training images obtained above are scaled so that the intensities are within the range . The training images are also resized to to ease the computation.
For the generator network, we adopt the structure similar to [Radford et al.2016, Dosovitskiy et al.2015]
. The network consists of multiple deconvolution (a.k.a convolutiontranspose) layers interleaved with ReLU nonlinearity and batch normalization
[Ioffe and Szegedy2015]. Specifically, we learn a 5 layer topdown convNet. The first deconvolutional layer has filters with kernel sizeand stride
. There are filters with kernel size and stride for the second, third, fourth and fifth deconvolution layers respectively. Each deconvolution layer is followed by ReLU nonlinearity and batch normalization except the last deconvolution layer which is instead followed by the tanh nonlinearity.For the inference model or the encoder network of VAE, we utilize the mirror structure of the generator network (which is the decoder network) where we use convolutional layers instead of deconvolutional ones. Besides, we use the ReLU with leaky factor as our nonlinearity. The mean and variance networks of the inference model share the same network structure except the top fullyconnected layer. We also adopt the batch normalization in the inference model as in the generator network.
We tried different dimensionalities for the latent code , including , and dimensions. We used Adam optimizer [Kingma and Ba2014] with initial learning rate for iterations. The outputs of the mean network of the inference model are used as the learned latent code and are denoted as . Realistic synthesized images can be generated by the trained generator network. See Figure 3 for some examples.
We design four experiments to examine the relationship between the AAM code for generating the face images and the code learned by the generator network, .
5.2 Linear Relationship
[Chang and Tsao2017] discovered that if a neuron has rampshaped tuning to different facial features, then its neural response can be approximated by a linear combination of the facial features. That is, the neural code for face patches ML/MF and AM has linear relationship with the AAM code of the presented face stimuli. In our first experiment, we check the strength of linearity between the code learned by the generator network and the underlying AAM code.
The codes for AAM, i.e., , are used to predict the corresponding codes learned by generator network, i.e., , and vice versa. Specifically, we fit linear model A and linear model B respectively:
(8)  
(9) 
We also include interception terms in both models. The goodness of fit of the model is determined by the percentage of variance in data that is explained by the fitted linear model, i.e., the socalled Rsquare ():
(10) 
where is the given code for image , denotes the fitted value, and is the average of the code. Higher indicates stronger linear strength.
The values for different dimensionalities of are shown in the first two rows of Table 1. In addition to the convolutionalbased (Conv) structures of the generator network stated above, we also test the linear relationship using fullyconnected (FC) structures. Specifically, we learn 4 FC layers with hidden dimension , using ReLU nonlinearity for the decoder and Leaky ReLU with factor for the encoder, and all the layers are followed by batch normalization except the last ones. The values are reported in the last two rows of Table 1. It can be seen that both models show strong linear relations. This is nontrivial and surprising, because when presented with only synthesized face stimuli, the VAE training of the highly nonlinear generator network [Montufar et al.2014] can automatically learn the code that is linearly related to the underlying AAM code that generates the given face stimuli. That is, the learned generator network shares similar behavior as the face patch systems ML/MF and AM in the primate brain.
dimension d for  d=20  d=100  d=200 
(A)(Conv)  0.9602  0.9624  0.9631 
(B)(Conv)  0.9644  0.9807  0.9889 
(A)(FC)  0.9585  0.9588  0.9594 
(B)(FC)  0.9410  0.9649  0.9709 
5.3 Decoding
As argued in [Chang and Tsao2017], we should be able to linearly decode the facial features from the neural responses if there is a linear relationship between them. If so, we can accurately predict what the primate brain sees by knowing only the neural responses of the brain. Knowing that our learned code of the generator network shows strong linear relationship with the facial features from the above experiment, we expect that our automatically learned code can accurately predict the facial features , which can then be used to reconstruct the input face image via the AAM. Therefore we further examine the decoding quality in this section.
To proceed, for training, we use and obtained during the learning process to fit model B as described above. Denote the estimated coefficients as . To test the decoding quality, we carry out the following two steps: (1) randomly sample a new set of AAM generated face images, which are used as the testing set. Then use the trained encoder network, i.e., mean network, of VAE to get point estimate of latent code of the generator network, i.e., . (2) Use the optimal to get the predicted AAM code:
(11) 
The predicted AAM code is then projected onto the previously learned AAM eigenvectors and to get the reconstructed image.
Figure 4 shows some testing images and the reconstructed ones. It can be seen that the linear model between the learned code by VAE and the AAM code gives us high decoding quality. In this experiment as well as the subsequent experiments, we set the dimensionality of to be . Other dimensionalities give similar results.
5.4 Shape/Appearance Separation
The latent code learned by the deep generative model is mixed with shape and appearance information. It would be useful to separate the shape and appearance parts of . In this experiment, we further identify the strengths of shape and appearance parts of the learned code .
From the first two experiments, we show that is linearly related to the AAM code , which contains both the shape and appearance parts. We can identify these two parts by projecting each dimension of onto the dimensional shape code and the dimensional appearance code . We can then obtain the relative for each part. A higher for one part indicates the stronger response for this part. Recall that . We fit the linear models on the shape part and the appearance part respectively:
(12)  
(13) 
Figures 5 and 6 show the values for each dimension of . It shows that each dimension of responds differently to shape and appearance. To further verify and visualize our analysis, we first choose three dimensions with the top for shape and three dimensions with the top for appearance. Then we visualize the generated images using the trained generator network by varying ( sd) the three chosen dimensions of the learned code while keeping the other three dimensions fixed. Figure 7 shows the result. It is clear that if we only vary the shape dimensions of the code (horizontally in the figure), the generated images mainly change their shapes while the appearances tend to remain similar. On the other hand, if we only vary the appearance dimensions of the code (vertically in the figure), the generated images mainly change their appearances instead of shapes.
5.5 Replicating AAM by Supervised Learning
So far the generator network learns from the AAM in the unsupervised manner, where the generator network only has access to the training images but not the latent code of the AAM. We now examine whether the generator network has enough expressive power to replicate the AAM in the supervised setting where we also provide the latent code of the AAM to the generator.
In this experiment, the synthesized face images and their AAM codes are given, and we use these pairs to learn the generator network. To be more specific, we first train the generator network using the images and their codes. Let us denote the trained generator as . Then, we prepare a new set of synthesized images and their AAM code as our testing set. is then fed into to get . If the generator network is capable of replicating the AAM, then should be close to , that is, the generated images by the trained generator network should be similar to the testing face images.
We use the same generator network structure as in the VAE training. We use the stochastic gradient descent (SGD) algorithm with momentum
to train the generator network for supervised learning. The learning rate is
with epochs. Figure 8 shows the groundtruth testing images generated by the AAM and the reconstructed images generated by the trained generator network. We also calculate the perpixel reconstruction error, which is 0.0113.6 Conclusion
The recent work in neuroscience [Chang and Tsao2017] shows that the face images can be reconstructed using the cell responses from face patches ML/MF and AM. To investigate whether the widely used generator network has the similar property, we design and conduct experiments to examine the relationship between the AAM code that generates the face stimuli and the automatically learned code by the generator network. Through the linearity analysis and the decoding quality analysis, we find that the biological observations made in [Chang and Tsao2017] can be qualitatively reproduced by the generator network, i.e., the learned code shows a strong linear relationship with the AAM code. Additionally, we can also use this relationship to further separate the shape and appearance parts of the learned code. Again this is similar to the neural system as it is found that ML/MF and AM carry complementary information about shape and appearance. Furthermore, we show that the generator network is capable of replicating AAM and we demonstrate this through supervised learning.
In this paper, we distill the knowledge of a pretrained AAM to the generator network. It will also be interesting to distill the knowledge of a learned generator network to an AAM in order to interpret the generator network. We leave it to future work.
Acknowledgments
The work is supported by DARPA SIMPLEX N6600115C4035, ONR MURI N000141612007, DARPA ARO W911NF1610579, and DARPA N660011724029. Part of the work was done while the first author was visiting Microsoft Research in Seattle. We thank Dr. Gang Hua for his help.
References
 [Blei et al.2017] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, (justaccepted), 2017.
 [Chang and Tsao2017] Le Chang and Doris Y Tsao. The code for facial identity in the primate brain. Cell, 169(6):1013–1028, 2017.
 [Chen et al.2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
 [Cootes et al.2001] Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. Active appearance models. IEEE Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001.
 [Cootes et al.2015] TF Cootes, MG Roberts, KO Babalola, and CJ Taylor. Active shape and appearance models. In Handbook of Biomedical Imaging, pages 105–122. Springer, 2015.
 [Dempster et al.1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: B, pages 1–38, 1977.
 [Denton et al.2015] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, pages 1486–1494, 2015.
 [Dinh et al.2016] Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. CoRR, abs/1605.08803, 2016.
 [Dosovitskiy et al.2015] E Dosovitskiy, J. T. Springenberg, and T Brox. Learning to generate chairs with convolutional neural networks. In CVPR, 2015.
 [Goodfellow et al.2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 [Han et al.2017] Tian Han, Yang Lu, SongChun Zhu, and Ying Nian Wu. Alternating BackPropagation for generator network. In AAAI, 2017.
 [Higgins et al.2016] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. 2016.
 [Hinton et al.1995] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The” wakesleep” algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995.
 [Ioffe and Szegedy2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [Kingma and Ba2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [Kingma and Welling2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [Kingma and Welling2014] Diederik P. Kingma and Max Welling. Autoencoding variational bayes. In ICLR, 2014.

[Li et al.2015]
Yujia Li, Kevin Swersky, and Rich Zemel.
Generative moment matching networks.
In
Proceedings of the 32nd International Conference on Machine Learning (ICML15)
, pages 1718–1727, 2015.  [Montufar et al.2014] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.

[Oord et al.2016]
Aaron Van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.
Pixel recurrent neural networks.
In Proceedings of The 33rd International Conference on Machine Learning, pages 1747–1756, 2016.  [Radford et al.2016] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.

[Rezende et al.2014]
Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In Tony Jebara and Eric P. Xing, editors, ICML, pages 1278–1286. JMLR Workshop and Conference Proceedings, 2014.  [Salimans et al.2015] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. In Proceedings of The 32nd International Conference on Machine Learning, pages 1218–1226, 2015.
Comments
There are no comments yet.