1 Introduction
This paper studies the fundamental problem of learning and inference in the generator network (Goodfellow et al., 2014), which is a generative model that has become popular recently. Specifically, we propose an alternating backpropagation algorithm for learning and inference in this model.
1.1 Nonlinear factor analysis
The generator network is a nonlinear generalization of factor analysis. Factor analysis is a prototype model in unsupervised learning of distributed representations. There are two directions one can pursue in order to generalize the factor analysis model. One direction is to generalize the prior model or the prior assumption about the latent factors. This led to methods such as independent component analysis
(Hyvärinen, Karhunen, and Oja, 2004), sparse coding (Olshausen and Field, 1997), nonnegative matrix factorization (Lee and Seung, 2001), matrix factorization and completion for recommender systems (Koren, Bell, and Volinsky, 2009), etc.The other direction to generalize the factor analysis model is to generalize the mapping from the continuous latent factors to the observed signal. The generator network is an example in this direction. It generalizes the linear mapping in factor analysis to a nonlinear mapping that is defined by a convolutional neural network (ConvNet or CNN) (LeCun et al., 1998; Krizhevsky, Sutskever, and Hinton, 2012; Dosovitskiy, Springenberg, and Brox, 2015). It has been shown recently that the generator network is capable of generating realistic images (Denton et al., 2015; Radford, Metz, and Chintala, 2016).
The generator network is a fundamental representation of knowledge, and it has the following properties: (1) Analysis: The model disentangles the variations in the observed signals into independent variations of latent factors. (2) Synthesis: The model can synthesize new signals by sampling the factors from the known prior distribution and transforming the factors into the signal. (3) Embedding
: The model embeds the highdimensional nonEuclidean manifold formed by the observed signals into the lowdimensional Euclidean space of the latent factors, so that linear interpolation in the lowdimensional factor space results in nonlinear interpolation in the data space.
1.2 Alternating backpropagation
The factor analysis model can be learned by the RubinThayer EM algorithm (Rubin and Thayer, 1982; Dempster, Laird, and Rubin, 1977)
, where both the Estep and the Mstep are based on multivariate linear regression. Inspired by this algorithm, we propose an alternating backpropagation algorithm for learning the generator network that iterates the following twosteps:
(1) Inferential backpropagation: For each training example, infer the continuous latent factors by Langevin dynamics or gradient descent.
(2) Learning backpropagation: Update the parameters given the inferred latent factors by gradient descent.
The Langevin dynamics (Neal, 2011) is a stochastic sampling counterpart of gradient descent. The gradient computations in both steps are powered by backpropagation. Because of the ConvNet structure, the gradient computation in step (1) is actually a byproduct of the gradient computation in step (2) in terms of coding.
Given the factors, the learning of the ConvNet is a supervised learning problem
(Dosovitskiy, Springenberg, and Brox, 2015) that can be accomplished by the learning backpropagation. With factors unknown, the learning becomes an unsupervised problem, which can be solved by adding the inferential backpropagation as an inner loop of the learning process. We shall show that the alternating backpropagation algorithm can learn realistic generator models of natural images, video sequences, and sounds.The alternating backpropagation algorithm follows the tradition of alternating operations in unsupervised learning, such as alternating linear regression in the EM algorithm for factor analysis, alternating least squares algorithm for matrix factorization (Koren, Bell, and Volinsky, 2009; Kim and Park, 2008), and alternating gradient descent algorithm for sparse coding (Olshausen and Field, 1997). All these unsupervised learning algorithms alternate an inference step and a learning step, as is the case with alternating backpropagation.
1.3 Explainingaway inference
The inferential backpropagation solves an inverse problem by an explainingaway process, where the latent factors compete with each other to explain each training example. The following are the advantages of the explainingaway inference of the latent factors:
(1) The latent factors may follow sophisticated prior models. For instance, in textured motions (Wang and Zhu, 2003) or dynamic textures (Doretto et al., 2003)
, the latent factors may follow a dynamic model such as vector autoregression. By inferring the latent factors that explain the observed examples, we can learn the prior model.
(2) The observed data may be incomplete or indirect. For instance, the training images may contain occluded objects. In this case, the latent factors can still be obtained by explaining the incomplete or indirect observations, and the model can still be learned as before.
1.4 Learning from incomplete or indirect data
We venture to propose that a main advantage of a generative model is to learn from incomplete or indirect data, which are not uncommon in practice. The generative model can then be evaluated based on how well it recovers the unobserved original data, while still learning a model that can generate new data. Learning the generator network from incomplete data can be considered a nonlinear generalization of matrix completion.
We also propose to evaluate the learned generator network by the reconstruction error on the testing data.
1.5 Contribution and related work
The main contribution of this paper is to propose the alternating backpropagation algorithm for training the generator network. Another contribution is to evaluate the generative models by learning from incomplete or indirect training data.
Existing training methods for the generator network avoid explainaway inference of latent factors. Two methods have recently been devised to accomplish this. Both methods involve an assisting network with a separate set of parameters in addition to the original network that generates the signals. One method is variational autoencoder (VAE) (Kingma and Welling, 2014; Rezende, Mohamed, and Wierstra, 2014; Mnih and Gregor, 2014), where the assisting network is an inferential or recognition network that seeks to approximate the posterior distribution of the latent factors. The other method is the generative adversarial network (GAN) (Goodfellow et al., 2014; Denton et al., 2015; Radford, Metz, and Chintala, 2016), where the assisting network is a discriminator network that plays an adversarial role against the generator network.
Unlike alternating backpropagation, VAE does not perform explicit explainaway inference, while GAN avoids inferring the latent factors altogether. In comparison, the alternating backpropagation algorithm is simpler and more basic, without resorting to an extra network. While it is difficult to compare these methods directly, we illustrate the strength of alternating backpropagation by learning from incomplete and indirect data, where we only need to explain whatever data we are given. This may prove difficult or less convenient for VAE and GAN.
Meanwhile, alternating backpropagation is complementary to VAE and GAN training. It may use VAE to initialize the inferential backpropagation, and as a result, may improve the inference in VAE. The inferential backpropagation may help infer the latent factors of the observed examples for GAN, thus providing a method to test if GAN can explain the entire training set.
2 Factor analysis with ConvNet
2.1 Factor analysis and beyond
Let be a dimensional observed data vector, such as an image. Let be the dimensional vector of continuous latent factors, . The traditional factor analysis model is , where is matrix, and is a dimensional error vector or the observational noise. We assume that , where stands for the
dimensional identity matrix. We also assume that
, i.e., the observational errors are Gaussian white noises. There are three perspectives to view
. (1) Basis vectors. Write , where each is a dimensional column vector. Then , i.e., are the basis vectors and are the coefficients. (2) Loading matrix. Write , where is the th row of . Then , where and are the th components of and respectively. Each is a loading of the factors where is a vector of loading weights, indicating which factors are important for determining . is called the loading matrix. (3) Matrix factorization. Suppose we observe , whose factors are , then .The factor analysis model can be learned by the RubinThayer EM algorithm, which involves alternating regressions of on in the Estep and of on in the Mstep, with both steps powered by the sweep operator (Rubin and Thayer, 1982; Liu, Rubin, and Wu, 1998).
The factor analysis model is the prototype of many subsequent models that generalize the prior model of . (1) Independent component analysis (Hyvärinen, Karhunen, and Oja, 2004), , , and are assumed to follow independent heavy tailed distributions. (2) Sparse coding (Olshausen and Field, 1997), , and is assumed to be a redundant but sparse vector, i.e., only a small number of are nonzero or significantly different from zero. (3) Nonnegative matrix factorization (Lee and Seung, 2001), it is assumed that . (4) Recommender system (Koren, Bell, and Volinsky, 2009), is a vector of a customer’s desires in different aspects, and is a vector of product ’s desirabilities in these aspects.
2.2 ConvNet mapping
In addition to generalizing the prior model of the latent factors , we can also generalize the mapping from to . In this paper, we consider the generator network model (Goodfellow et al., 2014) that retains the assumptions that , , and as in traditional factor analysis, but generalizes the linear mapping to a nonlinear mapping , where is a ConvNet, and collects all the connection weights and bias terms of the ConvNet. Then the model becomes
(1) 
The reconstruction error is . We may assume more sophisticated models for , such as colored noise or nonGaussian texture. If is binary, we can emit
by a probability map
, where the sigmoid transformation and Bernoulli sampling are carried out pixelwise. If is multilevel, we may assume multinomial logistic emission model or some ordinal emission model.Although can be any nonlinear mapping, the ConvNet parameterization of makes it particularly close to the original factor analysis. Specifically, we can write the topdown ConvNet as follows:
(2) 
where is elementwise nonlinearity at layer , is the matrix of connection weights, is the vector of bias terms at layer , and . , and . The topdown ConvNet (2) can be considered a recursion of the original factor analysis model, where the factors at the layer are obtained by the linear superposition of the basis vectors or basis functions that are column vectors of , with the factors at the layer serving as the coefficients of the linear superposition. In the case of ConvNet, the basis functions are shiftinvariant versions of one another, like wavelets. See Appendix for an indepth understanding of the model.
3 Alternating backpropagation
If we observe a training set of data vectors , then each has a corresponding , but all the share the same ConvNet . Intuitively, we should infer and learn to minimize the reconstruction error plus a regularization term that corresponds to the prior on .
More formally, the model can be written as and . Adopting the language of the EM algorithm (Dempster, Laird, and Rubin, 1977), the completedata model is given by
(3) 
The observeddata model is obtained by integrating out : . The posterior distribution of is given by as a function of .
For the training data , the completedata loglikelihood is , where we assume is given. Learning and inference can be accomplished by maximizing the completedata loglikelihood, which can be obtained by the alternating gradient descent algorithm that iterates the following two steps: (1) Inference step: update by running steps of gradient descent. (2) Learning step: update by one step of gradient descent.
A more rigorous method is to maximize the observeddata loglikelihood, which is . The observeddata loglikelihood takes into account the uncertainties in inferring . See Appendix for an indepth understanding.
The gradient of can be calculated according to the following wellknown fact that underlies the EM algorithm:
(4) 
The expectation with respect to can be approximated by drawing samples from and then computing the Monte Carlo average.
The Langevin dynamics for sampling iterates
(5) 
where denotes the time step for the Langevin sampling, is the step size, and denotes a random vector that follows . The Langevin dynamics (5) is an explainaway process, where the latent factors in compete to explain away the current residual .
To explain Langevin dynamics, its continuous time version for sampling is . The dynamics has as its stationary distribution, because it can be shown that for any wellbehaved testing function , if , then , as , so that . Alternatively, given , suppose , then as .
The stochastic gradient algorithm of (Younes, 1999) can be used for learning, where in each iteration, for each , only a single copy of is sampled from by running a finite number of steps of Langevin dynamics starting from the current value of , i.e., the warm start. With sampled in this manner, we can update the parameter based on the gradient , whose Monte Carlo approximation is:
(6)  
Algorithm 1 describes the details of the learning and sampling algorithm.
If the Gaussian noise in the Langevin dynamics (5) is removed, then the above algorithm becomes the alternating gradient descent algorithm. It is possible to update both and simultaneously by joint gradient descent.
Both the inferential backpropagation and the learning backpropagation are guided by the residual . The inferential backpropagation is based on , whereas the learning backpropagation is based on . Both gradients can be efficiently computed by backpropagation. The computations of the two gradients share most of their steps. Specifically, for the topdown ConvNet defined by (2), and
share the same code for the chain rule computation of
for . Thus, the code for is part of the code for .In Algorithm 1, the Langevin dynamics samples from a gradually changing posterior distribution because keeps changing. The updating of both and collaborate to reduce the reconstruction error . The parameter plays the role of annealing or tempering in Langevin sampling. If is very large, then the posterior is close to the prior . If is very small, then the posterior may be multimodal, but the evolving energy landscape of may help alleviate the trapping of the local modes. In practice, we tune the value of
instead of estimating it. The Langevin dynamics can be extended to Hamiltonian Monte Carlo
(Neal, 2011) or more sophisticated versions (Girolami and Calderhead, 2011).4 Experiments
The code in our experiments is based on the MatConvNet package of (Vedaldi and Lenc, 2015).
The training images and sounds are scaled so that the intensities are within the range . We adopt the structure of the generator network of (Radford, Metz, and Chintala, 2016; Dosovitskiy, Springenberg, and Brox, 2015)
, where the topdown network consists of multiple layers of deconvolution by linear superposition, ReLU nonlinearity, and upsampling, with tanh nonlinearity at the bottomlayer
(Radford, Metz, and Chintala, 2016) to make the signals fall within. We also adopt batch normalization
(Ioffe and Szegedy, 2015).We fix
for the standard deviation of the noise vector
. We use or 30 steps of Langevin dynamics within each learning iteration, and the Langevin step size is set at .1 or . We run learning iterations, with learning rate .0001, and momentum .5. The learning algorithm produces the learned network parameters and the inferred latent factors for each signal in the end. The synthesized signals are obtained by , where is sampled from the prior distribution .4.1 Qualitative experiments
Experiment 1. Modeling texture patterns. We learn a separate model from each texture image. The images are collected from the Internet, and then resized to 224 224. The synthesized images are 448 448. Figures 1 shows four examples.
The factors at the top layer form a image, with each pixel following independently. The image is then transformed to by the topdown ConvNet. We use in the learning stage for all the texture experiments. In order to obtain the synthesized image, we randomly sample a 14 14 from N, and then expand the learned network to generate the 448 448 synthesized image .
The training network is as follows. Starting from image , the network has 5 layers of deconvolution with kernels (i.e., linear superposition of basis functions), with an upsampling factor of 2 at each layer (i.e., the basis functions are 2 pixels apart). The number of channels in the first layer is 512 (i.e., 512 translation invariant basis functions), and is decreased by a factor 2 at each layer. The Langevin steps with step size .
Experiment 2. Modeling sound patterns. A sound signal can be treated as a onedimensional texture image (McDermott and Simoncelli, 2011). The sound data are collected from the Internet. Each training signal is a 5 second clip with the sampling rate of 11025 Hertz and is represented as a vector. We learn a separate model from each sound signal.
The latent factors form a sequence that follows N, with . The topdown network consists of 4 layers of deconvolution with kernels of size , and upsampling factor of 10. The number of channels in the first layer is 256, and decreases by a factor of 2 at each layer. For synthesis, we start from a longer Gaussian white noise sequence with and generate the synthesized sound by expanding the learned network. Figure 2 shows the waveforms of the observed sound signal in the first row and the synthesized sound signal in the second row.
dimension vector sampled from uniform distribution.
Experiment 3. Modeling object patterns. We model object patterns using the network structure that is essentially the same as the network for the texture model, except that we include a fully connected layer under the latent factors , now a dimensional vector. The images are . We use ReLU with a leaking factor .2 (Maas, Hannun, and Ng, 2013; Xu et al., 2015). The Langevin steps with step size .
In the first experiment, we learn a model where has two components, i.e., , and . The training data are 11 images of 6 tigers and 5 lions. After training the model, we generate images using the learned topdown ConvNet for , where we discretize both and into 9 equally spaced values. The left panel of Figure 3 displays the synthesized images on the panel.
In the second experiment, we learn a model with from 1000 face images randomly selected from the CelebA dataset (Liu et al., 2015). The left panel of Figure 4 displays the images generated by the learned model. The middle panel displays the interpolation results. The images at the four corners are generated by the vectors of four images randomly selected from the training set. The images in the middle are obtained by first interpolating the ’s of the four corner images using the sphere interpolation (Dinh, SohlDickstein, and Bengio, 2016) and then generating the images by the learned ConvNet.
We also provide qualitative comparison with Deep Convolutional Generative Adversarial Net (DCGAN) (Goodfellow et al., 2014; Radford, Metz, and Chintala, 2016). The right panel of Figure 3 shows the generated results for the liontiger dataset using dimensional . The right panel of Figure 4 displays the generated results trained on 1000 aligned faces from celebA dataset, with . We use the code from https://github.com/carpedm20/DCGANtensorflow, with the tuning parameters as in (Radford, Metz, and Chintala, 2016). We run iterations as in our method.
Experiment 4. Modeling dynamic patterns. We model a textured motion (Wang and Zhu, 2003) or a dynamic texture (Doretto et al., 2003) by a nonlinear dynamic system , and , where we assume the latent factors follow a vector autoregressive model, where is a matrix, and is the innovation. This model is a direct generalization of the linear dynamic system of (Doretto et al., 2003), where is reduced to
by principal component analysis (PCA) via singular value decomposition (SVD). We learn the model in two steps. (1) Treat
as independent examples and learn and infer as before. (2) Treat as the training data, learn and as in (Doretto et al., 2003). After that, we can synthesize a new dynamic texture. We start from , and then generate the sequence according to the learned model (we discard a burnin period of 15 frames). Figure 5 shows some experiments, where we set . The first row is a segment of the sequence generated by our model, and the second row is generated by the method of (Doretto et al., 2003), with the same dimensionality of . It is possible to generalize the autoregressive model of to recurrent network. We may also treat the video sequences as 3D images, and learn generator networks with 3D spatialtemporal filters or basis functions.4.2 Quantitative experiments
Experiment 5. Learning from incomplete data. Our method can learn from images with occluded pixels. This task is inspired by the fact that most of the images contain occluded objects. It can be considered a nonlinear generalization of matrix completion in recommender system.
Our method can be adapted to this task with minimal modification. The only modification involves the computation of . For a fully observed image, it is computed by summing over all the pixels. For a partially observed image, we compute it by summing over only the observed pixels. Then we can continue to use the alternating backpropagation algorithm to infer and learn . With inferred and learned , the image can be automatically recovered by . In the end, we will be able to accomplish the following tasks: (T1) Recover the occluded pixels of training images. (T2) Synthesize new images from the learned model. (T3) Recover the occluded pixels of testing images using the learned model.
experiment  P.5  P.7  P.9  M20  M30 
error  .0571  .0662  .0771  .0773  .1035 



We want to emphasize that in our experiments, all the training images are partially occluded. Our experiments are different from (1) denoising autoencoder (Vincent et al., 2008), where the training images are fully observed, and noises are added as a matter of regularization, (2) inpainting or denoising, where the prior model or regularization has already been learned or given. (2) is about task (T3) mentioned above, but not about tasks (T1) and (T2).
Learning from incomplete data can be difficult for GAN and VAE, because the occluded pixels are different for different training images.
We evaluate our method on 10,000 images randomly selected from CelebA dataset. We design 5 experiments, with two types of occlusions: (1) 3 experiments are about salt and pepper occlusion, where we randomly place masks on the image domain to cover roughly 50%, 70% and 90% of pixels respectively. These 3 experiments are denoted P.5, P.7, and P.9 respectively (P for pepper). (2) 2 experiments are about single region mask occlusion, where we randomly place a or mask on the image domain. These 2 experiments are denoted M20 and M30 respectively (M for mask). We set . Table 1 displays the recovery errors of the 5 experiments, where the error is defined as per pixel difference (relative to the range of the pixel values) between the original image and the recovered image on the occluded pixels. We emphasize that the recovery errors are not training errors, because the intensities of the occluded pixels are not observed in training. Figure 6 displays recovery results. In experiment P.9, 90 of pixels are occluded, but we can still learn the model and recover the original images.
experiment  

error  .0795  .0617  .0625 


Experiment 6. Learning from indirect data. We can learn the model from the compressively sensed data (Candès, Romberg, and Tao, 2006). We generate a set of white noise images as random projections. We then project the training images on these white noise images. We can learn the model from the random projections instead of the original images. We only need to replace by , where is the given white noise sensing matrix, and is the observation. We can treat as a fully connected layer of known filters below , so that we can continue to use alternating backpropagation to infer and learn , thus recovering the image by . In the end, we will be able to (T1) Recover the original images from their projections during learning. (T2) Synthesize new images from the learned model. (T3) Recover testing images from their projections based on the learned model. Our experiments are different from traditional compressed sensing, which is task (T3), but not tasks (T1) and (T2). Moreover, the image recovery in our work is based on nonlinear dimension reduction instead of linear sparsity.
We evaluate our method on 1000 face images randomly selected from CelebA dataset. These images are projected onto white noise images with each pixel randomly sampled from . After this random projection, each image of size becomes a dimensional vector. We show the recovery errors for different latent dimensions in Table 2, where the recovery error is defined as the per pixel difference (relative to the range of the pixel values) between the original image and the recovered image. Figure 7 shows some recovery results.
Experiment 7. Model evaluation by reconstruction error on testing data. After learning the model from the training images (now assumed to be fully observed), we can evaluate the model by the reconstruction error on the testing images. We randomly select 1000 face images for training and 300 images for testing from CelebA dataset. After learning, we infer the latent factors for each testing image using inferential backpropagation, and then reconstruct the testing image by using the inferred and the learned . In the inferential backpropagation for inferring , we initialize , and run 300 Langevin steps with step size .05. Table 3 shows the reconstruction errors of alternating backpropagation learning (ABP) as compared to PCA learning for different latent dimensions . Figure 8 shows some reconstructed testing images. For PCA, we learn the eigenvectors from the training images, and then project the testing images on the learned eigenvectors for reconstruction.
experiment  

ABP  .0810  .0617  .0549  .0523 
PCA  .1038  .0820  .0722  .0621 



Experiments 57 may be used to evaluate generative models in general. Experiments 5 and 6 appear new, and we have not found comparable methods that can accomplish all three tasks (T1), (T2), and (T3) simultaneously.
5 Conclusion
This paper proposes an alternating backpropagation algorithm for training the generator network. We recognize that the generator network is a nonlinear generalization of the factor analysis model, and develop the alternating backpropagation algorithm as the nonlinear generalization of the alternating regression scheme of the RubinThayer EM algorithm for fitting the factor analysis model. The alternating backpropagation algorithm iterates the inferential backpropagation for inferring the latent factors and the learning backpropagation for updating the parameters. Both backpropagation steps share most of their computing steps in the chain rule calculations.
Our learning algorithm is perhaps the most canonical algorithm for training the generator network. It is based on maximum likelihood, which is theoretically the most accurate estimator. The maximum likelihood learning seeks to explain and charge the whole dataset uniformly, so that there is little concern of underfitting or biased fitting.
As an unsupervised learning algorithm, the alternating backpropagation algorithm is a natural generalization of the original backpropagation algorithm for supervised learning. It adds an inferential backpropagation step to the learning backpropagation step, with minimal overhead in coding and affordable overhead in computing. The inferential backpropagation seeks to perform accurate explainingaway inference of the latent factors. It can be worthwhile for tasks such as learning from incomplete or indirect data, or learning models where the latent factors themselves follow sophisticated prior models with unknown parameters. The inferential backpropagation may also be used to evaluate the generators learned by other methods on tasks such as reconstructing or completing testing data.
Our method or its variants can be applied to nonlinear matrix factorization and completion. It can also be applied to problems where some components or aspects of the factors are supervised.
Code, images, sounds, and videos
Acknowledgement
We thank Yifei (Jerry) Xu for his help with the experiments during his 2016 summer visit. We thank Jianwen Xie for helpful discussions.
The work is supported by NSF DMS 1310391, DARPA SIMPLEX N6600115C4035, ONR MURI N000141612007, and DARPA ARO W911NF1610579.
6 Appendix
6.1 ReLU and piecewise factor analysis
The generator network is , , , with , and . The elementwise nonlinearity
in modern ConvNet is usually the twopiece linearity, such as rectified linear unit (ReLU)
(Krizhevsky, Sutskever, and Hinton, 2012) or the leaky ReLU (Maas, Hannun, and Ng, 2013; Xu et al., 2015). Each ReLU unit corresponds to a binary switch. For the case of nonleaky ReLU, following the analysis of (Pascanu, Montufar, and Bengio, 2013), we can write where is a diagonal matrix, is an elementwise indicator function. For the case of leaky ReLU, the 0 values on the diagonal are replaced by a leaking factor (e.g., .2).forms a classification of according to the network . Specifically, the factor space of
is divided into a large number of pieces by the hyperplanes
, and each piece is indexed by an instantiation of . We can write to make explicit its dependence on and . On the piece indexed by , . Assuming , for simplicity, we have . Thus each piece defined by corresponds to a linear factor analysis , whose basis is a multiplicative recomposition of the basis functions at multiple layers , and the recomposition is controlled by the binary switches at multiple layers . Hence the topdown ConvNet amounts to a reconfigurable basis for representing , and the model is a piecewise linear factor analysis. If we retain the bias term, we will have , for an overall bias term that depends on . So the distribution of is essentially piecewise Gaussian.The generator model can be considered an explicit implementation of the local linear embedding (Roweis and Saul, 2000), where is the embedding of . In local linear embedding, the mapping between and is implicit. In the generator model, the mapping from to is explicit. With ReLU ConvNet, the mapping is piecewise linear, which is consistent with local linear embedding, except that the partition of the linear pieces by in the generator model is learned automatically.
The inferential backpropagation is a Langevin dynamics on the energy function . With , . If belongs to the piece defined by , then the inferential backpropagation seeks to approximate by the basis
via a ridge regression. Because
keeps changing during the Langevin dynamics, may also be changing, and the algorithm searches for the optimal reconfigurable basis to approximate . We may solve by secondorder methods such as iterated ridge regression, which can be computationally more expensive than the simple gradient descent.6.2 EM, density mapping, and density shifting
Suppose the training data come from a data distribution . To understand how the alternating backpropagation algorithm or its EM idealization maps the prior distribution of the latent factors to the data distribution by the learned , we define
(7) 
where is obtained by averaging the posteriors over the observed data . That is, can be considered the data prior. The data prior is close to the true prior in the sense that
(8)  
The right hand side of (8) is minimized at the maximum likelihood estimate , hence the data prior at should be especially close to the true prior . In other words, at , the posteriors of all the data points tend to pave the true prior .
From Rubin’s multiple imputation point of view
(Rubin, 2004) of the EM algorithm, the Estep of EM infers for , where is the number of multiple imputations or multiple guesses of . The multiple guesses account for the uncertainty in inferring from . The Mstep of EM maximizes to obtain . For each data point , seeks to reconstruct by from the inferred latent factors . In other words, the Mstep seeks to map to . Pooling over all , , hence the Mstep seeks to map to the data distribution . Of course the mapping from to cannot be exact. In fact, maps to a dimensional patch around the dimensional . The local patches for all patch up the dimensional manifold form by the dimensional observed examples and their interpolations. The EM algorithm is a process of density shifting, so that shifts towards , thus maps to .6.3 Factor analysis and alternating regression
The alternating backpropagation algorithm is inspired by RubinThayer EM algorithm for factor analysis, where both the observed data model and the posterior distribution are available in closed form. The EM algorithm for factor analysis can be interpreted as alternating linear regression (Rubin and Thayer, 1982; Liu, Rubin, and Wu, 1998).
In the factor analysis model , ,
. The joint distribution of
is(9) 
Denote
(10) 
The posterior distribution can be obtained by linear regression of on , , where
(11)  
(12) 
The above computation can be carried out by the sweep operator on , with being the pivotal matrix.
Suppose we have observations . In the Estep, we compute
(13)  
(14) 
In the Mstep, we compute
(15)  
where we use and to denote the conditional expectations in (13) and (14). Then we regress on
to obtain the coefficient vector and residual variancecovariance matrix
(16)  
(17) 
If is unknown, it can be obtained by averaging the diagonal elements of . The computation can again be done by the sweep operator on , with being the pivotal matrix.
The Estep is based on the multivariate linear regression of on given . The Mstep updates by the multivariate linear regression of on . Both steps can be accomplished by the sweep operator. We use the notation and for the Gram matrices to highlight the analogy between the two steps. The EM algorithm can then be considered alternating linear regression or alternating sweep operation, which serves as a prototype for alternating backpropagation.
References
 Candès, Romberg, and Tao (2006) Candès, E. J.; Romberg, J.; and Tao, T. 2006. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory 52(2):489–509.
 Dempster, Laird, and Rubin (1977) Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: B 1–38.
 Denton et al. (2015) Denton, E. L.; Chintala, S.; Fergus, R.; et al. 2015. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, 1486–1494.
 Dinh, SohlDickstein, and Bengio (2016) Dinh, L.; SohlDickstein, J.; and Bengio, S. 2016. Density estimation using real nvp. CoRR abs/1605.08803.
 Doretto et al. (2003) Doretto, G.; Chiuso, A.; Wu, Y.; and Soatto, S. 2003. Dynamic textures. IJCV 51(2):91–109.
 Dosovitskiy, Springenberg, and Brox (2015) Dosovitskiy, E.; Springenberg, J. T.; and Brox, T. 2015. Learning to generate chairs with convolutional neural networks. In CVPR.
 Girolami and Calderhead (2011) Girolami, M., and Calderhead, B. 2011. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: B 73(2):123–214.
 Goodfellow et al. (2014) Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS, 2672–2680.
 Hyvärinen, Karhunen, and Oja (2004) Hyvärinen, A.; Karhunen, J.; and Oja, E. 2004. Independent component analysis. John Wiley & Sons.
 Ioffe and Szegedy (2015) Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
 Kim and Park (2008) Kim, H., and Park, H. 2008. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM Journal on Matrix Analysis and Applications 30(2):713–730.
 Kingma and Welling (2014) Kingma, D. P., and Welling, M. 2014. Autoencoding variational bayes. In ICLR.
 Koren, Bell, and Volinsky (2009) Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix factorization techniques for recommender systems. Computer 42(8):30–37.
 Krizhevsky, Sutskever, and Hinton (2012) Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1097–1105.
 LeCun et al. (1998) LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 Lee and Seung (2001) Lee, D. D., and Seung, H. S. 2001. Algorithms for nonnegative matrix factorization. In NIPS, 556–562.
 Liu et al. (2015) Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep learning face attributes in the wild. In ICCV, 3730–3738.
 Liu, Rubin, and Wu (1998) Liu, C.; Rubin, D. B.; and Wu, Y. N. 1998. Parameter expansion to accelerate em: The pxem algorithm. Biometrika 85(4):755–770.
 Lu, Zhu, and Wu (2016) Lu, Y.; Zhu, S.C.; and Wu, Y. N. 2016. Learning FRAME models using CNN filters. In AAAI.
 Maas, Hannun, and Ng (2013) Maas, A. L.; Hannun, A. Y.; and Ng, A. Y. 2013. Rectifier nonlinearities improve neural network acoustic models. In ICML.
 McDermott and Simoncelli (2011) McDermott, J. H., and Simoncelli, E. P. 2011. Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71(5):926–940.
 Mnih and Gregor (2014) Mnih, A., and Gregor, K. 2014. Neural variational inference and learning in belief networks. In ICML.

Neal (2011)
Neal, R. M.
2011.
Mcmc using hamiltonian dynamics.
Handbook of Markov Chain Monte Carlo
2.  Olshausen and Field (1997) Olshausen, B. A., and Field, D. J. 1997. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research 37(23):3311–3325.
 Pascanu, Montufar, and Bengio (2013) Pascanu, R.; Montufar, G.; and Bengio, Y. 2013. On the number of response regions of deep feed forward networks with piecewise linear activations. arXiv:1312.6098.
 Radford, Metz, and Chintala (2016) Radford, A.; Metz, L.; and Chintala, S. 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR.

Rezende, Mohamed, and
Wierstra (2014)
Rezende, D. J.; Mohamed, S.; and Wierstra, D.
2014.
Stochastic backpropagation and approximate inference in deep generative models.
In NIPS, 1278–1286.  Roweis and Saul (2000) Roweis, S. T., and Saul, L. K. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326.
 Rubin and Thayer (1982) Rubin, D. B., and Thayer, D. T. 1982. Em algorithms for ml factor analysis. Psychometrika 47(1):69–76.
 Rubin (2004) Rubin, D. B. 2004. Multiple imputation for nonresponse in surveys, volume 81. John Wiley & Sons.
 Vedaldi and Lenc (2015) Vedaldi, A., and Lenc, K. 2015. Matconvnet – convolutional neural networks for matlab. In Int. Conf. on Multimedia.

Vincent et al. (2008)
Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P.A.
2008.
Extracting and composing robust features with denoising autoencoders.
In ICML, 1096–1103.  Wang and Zhu (2003) Wang, Y., and Zhu, S.C. 2003. Modeling textured motion: Particle, wave and sketch. In ICCV, 213–220.
 Xie et al. (2016) Xie, J.; Lu, Y.; Zhu, S.C.; and Wu, Y. N. 2016. A theory of generative convnet. In ICML.
 Xu et al. (2015) Xu, B.; Wang, N.; Chen, T.; and Li, M. 2015. Empirical evaluation of rectified activations in convolutional network. CoRR abs/1505.00853.
 Younes (1999) Younes, L. 1999. On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics: An International Journal of Probability and Stochastic Processes 65(34):177–228.
Comments
There are no comments yet.