1 Introduction
Machine learning is about capturing aspects of the unknown distribution from which the observed data are sampled (the datagenerating distribution). For many learning algorithms and in particular in manifold learning, the focus is on identifying the regions (sets of points) in the space of examples where this distribution concentrates, i.e., which configurations of the observed variables are plausible.
Unsupervised representationlearning algorithms attempt to characterize the datagenerating distribution through the discovery of a set of features or latent variables whose variations capture most of the structure of the datagenerating distribution. In recent years, a number of unsupervised feature learning algorithms have been proposed that are based on minimizing some form of reconstruction error, such as autoencoder and sparse coding variants (Bengio et al., 2007; Ranzato et al., 2007; Jain and Seung, 2008; Ranzato et al., 2008; Vincent et al., 2008; Kavukcuoglu et al., 2009; Rifai et al., 2011a, b, c; Gregor et al., 2011). An autoencoder reconstructs the input through two stages, an encoder function (which outputs a learned representation of an example ) and a decoder function , such that for most sampled from the datagenerating distribution. These feature learning algorithms can be stacked to form deeper and more abstract representations. There are arguments and much empirical evidence to suggest that when they are welltrained, such deep learning algorithms (Hinton et al., 2006; Bengio, 2009; Lee et al., 2009; Salakhutdinov and Hinton, 2009) can perform better than their shallow counterparts, both in terms of learning features for the purpose of classification tasks and for generating higherquality samples.
Here we restrict ourselves to the case of continuous inputs with the datagenerating distribution being associated with an unknown target density function, denoted . Manifold learning algorithms assume that is concentrated in regions of lower dimension (Cayton, 2005; Narayanan and Mitter, 2010), i.e., the training examples are by definition located very close to these highdensity manifolds. In that context, the core objective of manifold learning algorithms is to identify the local directions of variation, such that small movement in input space along these directions stays on or near the highdensity manifold.
Some important questions remain concerning many of these feature learning algorithms. What is their training criterion learning about the input density? Do these algorithms implicitly learn about the whole density or only some aspect? If they capture the essence of the target density, then can we formalize that link and in particular exploit it to sample from the model? This would turn these algorithms into implicit density models, which only define a density indirectly, e.g., through a generative procedure that converges to it. These are the questions to which this paper contributes.
A crucial starting point for this work is very recent work (Rifai et al., 2012) proposing a sampling algorithm for Contractive AutoEncoders
, detailed in the next section. This algorithm was motivated on geometrical grounds, based on the observation and intuition that the leading singular vectors of the Jacobian of the encoder function specify those main directions of variation (i.e., the tangent plane of the manifold, the local directions that preserve the highprobability nature of training examples). Here we make a formal link between the target density and models minimizing reconstruction error through a contractive mapping, such as the Contractive AutoEncoder
(Rifai et al., 2011a) and the Denoising AutoEncoder (Vincent et al., 2008). This allows us to justify sampling algorithms similar to that proposed by (Rifai et al., 2012), and apply these ideas to Denoising AutoEncoders as well.We define a novel alternative to maximum likelihood training, local moment matching
, which we find that Contractive and Denoising AutoEncoders perform. This is achieved by optimizing a criterion (such as a regularized reconstruction error) such that the optimal learned reconstruction function (and its derivatives) provide estimators of the local moments (and local derivatives) of the target density. These local moments can be used to define an implicit density, the asymptotic distribution of a particular Markov chain, which can also be seen as corresponding to an
uncountable Gaussian mixture, with one Gaussian component at each possible location in input space.The main novel contributions of this paper are the following. First, we show in Section 2 that the Denoising AutoEncoder with small Gaussian perturbations and squared error loss is actually a Contractive AutoEncoder whose contraction penalty is the magnitude of the perturbation, making the theory developed here applicable to both, and in particular extending the sampling procedure used for the Contractive AutoEncoder to the Denoising AutoEncoder as well. Second, we present in Section 3.3 consistency arguments justifying the use of the first and second estimated local moments to sample from a chain. Such a sampling algorithm has successfully been used in Rifai et al. (2012) to sample from a Contractive AutoEncoder. With small enough steps, we show that the asymptotic distribution of the chain has the same similar (smoothed) first and second local moments as those estimated. Third, we show in Section 3.4 that nonparametrically minimizing reconstruction error with a contractive regularizer yields a reconstruction function whose value and Jacobian matrix estimate respectively the first and second local moments, i.e., up to a scaling factor, are the right functions to use in the Markov chain. Finally, although the sampling algorithm was already empirically verified in Rifai et al. (2012), we include in Section 4 an experimental validation for the case when the model is trained with the denoising criterion.
2 Contractive and Denoising AutoEncoders
The Contractive AutoEncoder or CAE (Rifai et al., 2011a) is trained to minimize the following regularized reconstruction error:
(1) 
where and is the sum of the squares of the elements of . Both the squared loss and the crossentropy loss have been used, but here we focus our analysis on the squared loss because of the easier mathematical treatment it allows. Note that success of minimizing the above criterion strongly depends on the parametrization of and and in particular on the tied weights constraint used, with and . The above regularizing term forces (as well as
, because of the tied weights) to be contractive, i.e., to have singular values less than 1
^{1}^{1}1Note that an autoencoder without any regularization would tend to find many leading singular values near 1 in order to minimize reconstruction error, i.e., preserve input norm in all the directions of variation present in the data.. Larger values of yielding more contraction (smaller singular values) where it hurts reconstruction error the least, i.e., in the local directions where there are only little or no variations in the data.The Denoising AutoEncoder or DAE (Vincent et al., 2008) is trained to minimize the following denoising criterion:
(2) 
where is a stochastic corruption of and the expectation is over the training distribution. Here we consider mostly the squared loss and Gaussian noise corruption, again because it is easier to handle them mathematically. In particular, if with
a small zeromean isotropic Gaussian noise vector of variance
, then a Taylor expansion around gives , which when plugged into gives(3)  
where in the second line we used the independance of the noise from and properties of the trace, while in the last line we used and by definition of . This derivation shows that the DAE is also a Contractive AutoEncoder but where the contraction is imposed explicitly on the whole reconstruction function rather than on alone (and as a side effect of the parametrization).
2.1 A CAE Sampling Algorithm
Consider the following coupled Markov chains with elements and respectively:
(4) 
where and is a sample from a zeromean Gaussian with covariance .
The basic algorithm for sampling from the CAE, proposed in Rifai et al. (2012), is based on the above Markov chain operating in the space of hidden representation , with and , where is zeromean isotropic Gaussian noise vector in
space. This defines a chain of hidden representations
, and the corresponding chain of inputspace samples is given by . Slightly better results are obtained with this space than with the corresponding space chain which defines and where is zeromean isotropic Gaussian noise in space. We conjecture that this advantage stems from the fact that moves in space are done in a more abstract, more nonlinear space.3 Local Moment Matching as an Alternative to Maximum Likelihood
3.1 Previous Related Work
Wellknown manifold learning (“embedding”) algorithms include Kernel PCA (Schölkopf et al., 1998), LLE (Roweis and Saul, 2000), Isomap (Tenenbaum et al., 2000), Laplacian Eigenmap (Belkin and Niyogi, 2003), Hessian Eigenmaps (Donoho and Grimes, 2003), Semidefinite Embedding (Weinberger and Saul, 2004), SNE (Hinton and Roweis, 2003) and tSNE (van der Maaten and Hinton, 2008)
that were primarily developed and used for data visualization through dimensionality reduction. These algorithms optimize the hidden representation associated with training points in order to best preserve certain properties of an inputspace neighborhood graph.
The properties that we are interested in here are the local mean and local covariance. They are defined as the mean and covariance of a density restricted to a small neighborhood. For example, if we have lots of samples and we only consider the samples around a point , the local mean at would be estimated by the mean of these neighbors and the local covariance at by the empirical covariance among these neighbors. There are previous machine learning algorithms that have been proposed to estimate these local first and second moments by actually using local neighbors (Brand, 2003; Vincent and Bengio, 2003; Bengio et al., 2006). In Manifold Parzen Windows (Vincent and Bengio, 2003)
this is literally achieved by estimating for each test point the empirical mean and empirical covariance of the neighboring points, with a regularization of the empirical covariance that sets a floor value for the eigenvalues of the covariance matrix. In NonLocal Manifold Parzen
(Bengio et al., 2006), the mean and covariance are predicted by a neural network that takes
as input and outputs the estimated mean along with a basis for the leading eigenvectors and eigenvalues of the estimated covariance. The predictor is trained to maximize the likelihood of the near neighbors under the Gaussian with the predicted mean and covariance parameters. Both algorithms are manifold learning algorithms motivated by the objective to discover the local manifold structure of the data, and in particular predict the manifold tangent planes around any given test point. Besides the computational difficulty of having to find the
nearest neighbors of each training example, these algorithms, especially Manifold Parzen Windows, heavily rely on the smoothness of the target manifolds, so that there are enough samples to teach the model how and where the manifolds bend.Note that the term local moment matching was already used by Gerber (1982) in an actuarial context to match moments of a discretized scalar distribution. Here we consider the more general problem of modeling a multivariate density from data, by estimating the first and second multivariate moments at every possible input point.
3.2 A Sampling Procedure From Local Moment Estimators
We first show that mild conditions suffice for the chain to
converge ^{2}^{2}2This is also shown in Rifai et al. (2012).
and then that if local first and second moments have been estimated,
then one can define a plausible sampling algorithm based on a Markov chain
that exploits these local moments at each step.
Convergence of the Chain
This Markov chain is precisely the one that samples a new point by sampling from the Gaussian with these local first and second moments:
(5) 
as in Eq.(4), but where we propose to choose proportional to the local mean at minus , and proportional to the local covariance at . The functions and thus define a Markov chain.
Let us sketch a proof that it converges under mild hypotheses, using the
decomposition into the chains of ’s and of ’s. Assuming that
for some bounded ball , then . If we further assume that is always full rank,
then there is a nonzero probability of jumping from any
to any , which is sufficient for ergodicity of the
chain and its convergence. Then if ’s converge, so do their noisy
counterparts ’s.
Uncountable Gaussian Mixture
If the chain converges, let be the asymptotic distribution of the ’s. It is interesting to note that satisfies the operator equation
(6) 
where is the density of under a normal multivariate distribution with mean and covariance . This can be seen as a kind of uncountable Gaussian mixture where the weight of each component is given by itself and the functions and specify the mean and covariance of each component.
3.3 Consistency
From the point of view of learning, what we would like is that the and functions used in a sampling chain such as Eq. (5) be such that they yield an asymptotic density close to some target density . Because the Markov chain makes noisy finite steps, one would expect that the best one can hope for is that be a smooth approximation of .
What we show below is that, in the asymptotic regime of very small steps (i.e. is small in magnitude), a good choice is just the intuitive one, where and .
For this purpose, we formally define the local density of points in the neighborhood of an , as
(7) 
where is the appropriate normalizing constant. Then we respectively define the local mean and covariance around as simply being
(8) 
Note that we have two scales here, the scale at which we take the local mean and covariance, and the scale of the Markov chain steps. To prove consistency we assume here that and that both are small. Furthermore, we assume that and are somehow calibrated, so that the steps are comparable in size to . This means that , which we use below.
We want to compute the local mean obtained when we follow the Markov chain, i.e., in Eq. (7) we choose a equal to (which is defined in Eq. (6)), and we obtain the following:
(9)  
Because the Gaussian sample of the inner integral must be in a small region inside the ball, the inner integral is approximately the Gaussian mean if is in the ball, and 0 otherwise:
(10) 
which gives
(11) 
Now we use the above assumptions, which give , to conclude that integrating under the region defined by is equivalent to integrating under the region defined by . Hence the above approximation is rewritten
(12) 
which by the definition of gives the final result:
(13) 
It means that the local mean under the smallsteps Markov chain is a local mean of the chain’s ’s. This justifies choosing equal to the local mean of the target density to be represented by the chain, so that the Markov chain will yield an asymptotic distribution that has local moments that are close but smooth versions of those of the target density.
A similar result can be shown for the covariance by observing that the term in Eq. (10) produced the first moment of the Gaussian and that the same reasoning would apply with instead.
Alternatively, we can follow a shortened version of the above starting with
(14)  
(15) 
where
(16)  
(17) 
By construction the magnitude of the covariance was made very small, so the following term vanishes
(18) 
and we are left with the desired result
(19) 
3.4 Local Moment Matching By Minimizing Regularized Reconstruction Error
We consider here an alternative to using nearest neighbors for estimating local moments, by showing that minimizing reconstruction error with a contractive penalty yields estimators of the local mean and covariance.
We start from a training criterion similar to the CAE’s but penalizing the contraction of the whole autoencoder’s reconstruction function, which is also equivalent to the DAE’s training criterion in the case of small Gaussian corruption noise (as shown in Eq. (3)):
(20) 
where is the target or training distribution. We prove that in a nonparametric setting (where is completely free), the optimal is such that the local mean is estimated by while the local covariance is estimated^{3}^{3}3In practice, i.e., the parametric case, there is no guarantee that the estimator be symmetric, but this is easily fixed by symmetrizing it, i.e., using . by .
To find out what the autoencoder estimates we follow an approach which has already been used, e.g., to show that minimizing squared prediction error is equivalent to estimating the conditional expectation, . For this purpose we consider an asymptotic and nonparametric setting corresponding to the limit where the number of examples goes to infinity (we actually minimize the expected error) and the capacity goes to infinity: we allow the value and the derivative of at every point , to be different, i.e., we “parametrize” in every neighborhood around by
(21) 
which is like a Taylor expansion only valid in the neighborhood of ’s around , but where we actually consider and to be parameters free to be chosen separately for each .
Armed with this nonparametric formulation, we consider an infinity of such neighborhoods and the local density (Eq. 7) which approaches a Dirac delta function in the limit of , and we rewrite as follows:
(22) 
The reason for choosing that turns into a Dirac is that the expectations of and arising in the above inner integral will give rise to the local mean and local covariance . If in the above equation we define nonparametrically (as per Eq. (21)), the minimum can be achieved by considering the separate minimization in each neighborhood with respect to and . We can express the local contribution to the loss at as
(23) 
so that
(24) 
We take the gradient of the local loss with respect to and and set it to 0 (detailed derivation in Appendix) to get
(25)  
Solving these equations (detailed derivation in Appendix) gives us the solutions
(26) 
Note that these quantities are defined through so they depend implicitly on and we should consider what happens when we take the limit .
In particular, when we have that and we can see from the solutions (26) that this forces . In a practical numerical application, we fix to be small and it becomes interesting to see how these quantities relate asymptotically (in terms of decreasing). In such a situation, we have the following asymptotically:^{4}^{4}4Here, to avoid confusion with the overloaded notation for sampling, we instead use the notation to denote that the ratio of any coefficient on the left with its corresponding coefficient on the right goes to 1 as .
(27) 
Thus, we have proved that the value of the reconstruction and its Jacobian immediately give us estimators of the local mean and local covariance, respectively.
4 Experimental Validation
In the above analysis we have considered the limit of a nonparametric case with an infinite amount of data. In that limit, unsurprisingly, reconstruction is perfect, i.e., , , and and , at a speed that depends on the scale , as we have seen above. We do not care so much about the magnitudes but instead care about the directions indicated by (which indicate where to look for increases in probability density) and by the singular vectors and relative singular values of , which indicate what directions preserve high density. In a practical sampling algorithm such as described in Section 2.1, one wants to take noninfinitesimal steps. Furthermore, in the practical experiments of Rifai et al. (2012)
, in order to get good generalization with a limited training set, one typically works with a parametrized model which cannot perfectly reconstruct the training set (but can generalize). This means that the learned reconstruction is not equal to the input, even on training examples, and nor is the Jacobian of the reconstruction function tiny (as it would be in the asymptotic nonparametric case). The mathematical link between the two situations needs to be clarified in future work, but a heuristic approach which we found to work well is the following: control the scale of the Markov chain with a hyperparameter that sets the magnitude of the Gaussian noise (the variance of
in Section 2.1). That hyperparameter can be optimized by visual inspection or by estimating the loglikelihood of the samples, using a technique introduced in (Breuleux et al., 2011) and also used in Rifai et al. (2012). The basic idea is to generate many samples from the model, train a nonparametric density estimator (Parzen Windows) using these samples as a training set, and evaluate the loglikelihood of the test set using that density estimaor. If the sample generator does not mix well, then some test examples will be badly covered (far from any of the generated samples), thus incurring a high price in loglikelihood. If the generator mixes well but smoothes too much the true density, then the automatically selected bandwidth of the Parzen Windows will be chosen larger, incurring again a penalty in test loglikelihood.In Fig. 1, we show samples of DAEs trained and sampled similarly as in Rifai et al. (2012) on both MNIST digits images and the Toronto Face Dataset (TFD) (Susskind et al., 2010). These results and those already obtained in Rifai et al. (2012) confirm that the autoencoder trained either as a CAE or a DAE estimates local moments that can be followed in a Markov chain to generate likelylooking samples (and which have been shown quantitatively to be of high quality in Rifai et al. (2012)).
5 Conclusion
This paper has contributed a novel approach to modeling densities: indirectly, through the estimation of local moments. It has shown that local moments can be estimated by autoencoders with contractive regularization. It has justified a sampling algorithm based on a simple Markov chain when estimators of the local moments are available. Whereas autoencoders are unsupervised learning algorithms that have been known for many decades, it has never been clear if they captured everything that can be captured from a distribution. For the first time, this paper presents a theoretical justification showing that they do implicitly perform density estimation (provided some appropriate regularization is used, and assuming the training criterion can be minimized). This provides a more solid footing to the recently proposed algorithm for sampling Contractive AutoEncoders
(Rifai et al., 2012) and opens the door to other related learning and sampling algorithms. In particular, it shows that this sampling algorithm can be applied to Denoising AutoEncoders as well.An interesting advantage of modeling data through such training criteria is that there is no need to estimate an untractable partition function or its gradient, and that there is no difficult inference problem associated with these types of models either. Future work following up on this paper should try to answer the more difficult mathematical questions of what happens (e.g., with the consistency arguments presented here) if the Markov chain steps are not tiny, and when we consider a learner that has parametric constraints, rather than the asymptotic nonparametric limit considered here. We believe that the approach presented here can also be applied to energybased models (for which the free energy can be computed), and that the local moments are directly related to the first and second derivative of the estimated density. Future work should clarify that relationship, possibly giving rise to new sampling algorithms for energybased models (since we have shown here that one can sample from the estimated density if one can compute the local moments). Finally, it would be interesting to extend this work in the direction of analysis that explicitly takes into account the decomposition of the autoencoder into an encoder and a decoder. Indeed, we have found experimentally that sampling in the representation space gives better results than sampling in the original input space, but a more solid mathematical treatment of such algorithms is still missing.
References
 Belkin and Niyogi (2003) Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373–1396.
 Bengio (2009) Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009.
 Bengio et al. (2006) Bengio, Y., Larochelle, H., and Vincent, P. (2006). Nonlocal manifold Parzen windows. In NIPS’2005, pages 115–122. MIT Press.
 Bengio et al. (2007) Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layerwise training of deep networks. In NIPS’2006.
 Brand (2003) Brand, M. (2003). Charting a manifold. In NIPS’2002, pages 961–968. MIT Press.
 Breuleux et al. (2011) Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating representative samples from an rbmderived process. Neural Computation, 23(8), 2053–2073.
 Cayton (2005) Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS20080923, UCSD.

Donoho and Grimes (2003)
Donoho, D. L. and Grimes, C. (2003).
Hessian eigenmaps: new locally linear embedding techniques for highdimensional data.
Technical Report 200308, Dept. Statistics, Stanford University.  Gerber (1982) Gerber, H. (1982). On the numerical evaluation of the distribution of aggregate claims and its stoploss premiums. Insurance : Mathematics and Economics, 1, 13–18.
 Gregor et al. (2011) Gregor, K., Szlam, A., and LeCun, Y. (2011). Structured sparse coding via lateral inhibition. In NIPS’2011.
 Hinton and Roweis (2003) Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002.
 Hinton et al. (2006) Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
 Jain and Seung (2008) Jain, V. and Seung, S. H. (2008). Natural image denoising with convolutional networks. In NIPS’2008.
 Kavukcuoglu et al. (2009) Kavukcuoglu, K., Ranzato, M.A., Fergus, R., and LeCun, Y. (2009). Learning invariant features through topographic filter maps. In CVPR’2009.

Lee et al. (2009)
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009).
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.
In L. Bottou and M. Littman, editors, Proceedings of the Twentysixth International Conference on Machine Learning (ICML’09). ACM, Montreal (Qc), Canada. 
Narayanan and Mitter (2010)
Narayanan, H. and Mitter, S. (2010).
Sample complexity of testing the manifold hypothesis.
In NIPS’2010.  Ranzato et al. (2007) Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efficient learning of sparse representations with an energybased model. In NIPS’06.
 Ranzato et al. (2008) Ranzato, M., Boureau, Y.L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS’07), pages 1185–1192, Cambridge, MA. MIT Press.

Rifai et al. (2011a)
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a).
Contracting autoencoders: Explicit invariance during feature extraction.
In ICML’2011.  Rifai et al. (2011b) Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. (2011b). Higher order contractive autoencoder. In ECML PKDD.

Rifai et al. (2011c)
Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011c).
The manifold tangent classifier.
In NIPS’2011.  Rifai et al. (2012) Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling contractive autoencoders. In ICML’2012.
 Roweis and Saul (2000) Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.

Salakhutdinov and Hinton (2009)
Salakhutdinov, R. and Hinton, G. (2009).
Deep Boltzmann machines.
InProceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS 2009)
, volume 8.  Schölkopf et al. (1998) Schölkopf, B., Smola, A., and Müller, K.R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319.
 Susskind et al. (2010) Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML TR 2010001, U. Toronto.
 Tenenbaum et al. (2000) Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.
 van der Maaten and Hinton (2008) van der Maaten, L. and Hinton, G. E. (2008). Visualizing data using tsne. J. Machine Learning Res., 9.
 Vincent and Bengio (2003) Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002. MIT Press.

Vincent et al. (2008)
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.A. (2008).
Extracting and composing robust features with denoising autoencoders.
In ICML 2008.  Weinberger and Saul (2004) Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by semidefinite programming. In CVPR’2004, pages 988–995.
Appendix A Derivation of the local training criterion gradient
We want to obtain the derivative of the local training criterion,
(28) 
We use the definitions
We first expand the expected square error:
Differentiating this with respect to yields
corresponding to Eq.(25) of the paper.
For differentiating with respect to , we use the trace properties
We obtain for the regularizer,
and for the MSE:
Appendix B Detailed derivation of the minimizers of the local training criterion
Starting with , we solve when the gradient is 0 to obtain
Substituting that value for into the expression for the gradient with respect to , we get
Adding the regularizer term and setting the gradient to 0, we get
which altogether gives us Eq.(26) from the main text:
Note that we can also solve for
and for :
However, we still have to take the limit as . Noting that the magnitude of goes to 0 as , it means that also goes to 0 in magnitude. Plugging in the above equations gives the final results:
Comments
There are no comments yet.