Research in deep learning (seeBengio (2009) and Bengio et al. (2013a)Hinton et al., 2006), auto-encoder variants (Bengio et al., 2007; Vincent et al., 2008), and sparse coding variants (Lee et al., 2007; Ranzato et al., 2007)
. However, the most impressive recent results have been obtained with purely supervised learning techniques for deep networks, in particular for speech recognition(Dahl et al., 2010; Deng et al., 2010; Seide et al., 2011) and object recognition (Krizhevsky et al., 2012). The latest breakthrough in object recognition (Krizhevsky et al., 2012) was achieved with fairly deep convolutional networks with a form of noise injection in the input and hidden layers during training, called dropout (Hinton et al., 2012).
In all of these cases, the availability of large quantities of labeled data was critical.
On the other hand, progress with deep unsupervised architectures has been slower, with the established approaches with a probabilistic footing being the Deep Belief Network (DBN)(Hinton et al., 2006) and the Deep Boltzmann Machine (DBM) (Salakhutdinov and Hinton, 2009). Although single-layer unsupervised learners are fairly well developed and used to pre-train these deep models, jointly training all the layers with respect to a single unsupervised criterion remains a challenge, with a few techniques arising to reduce that difficulty (Montavon and Muller, 2012; Goodfellow et al., 2013). In contrast to recent progress toward joint supervised training of models with many layers, joint unsupervised training of deep models remains a difficult task.
In particular, the normalization constant involved in complex multimodal probabilistic models is often intractable and this is dealt with using various approximations (discussed below) whose limitations may be an important part of the difficulty for training and using deep unsupervised, semi-supervised or structured output models.
Though the goal of training large unsupervised networks has turned out to be more elusive than its supervised counterpart, the vastly larger available volume of unlabeled data still beckons for efficient methods to model it. Recent progress in training supervised models raises the question: can we take advantage of this progress to improve our ability to train deep, generative, unsupervised, semi-supervised or structured output models?
This paper lays theoretical foundations for a move in this direction through the following main contributions:
1 – Intuition: In Section 2 we discuss what we view as basic motivation for studying alternate ways of training unsupervised probabilistic models, i.e., avoiding the intractable sums or maximization involved in many approaches.
2 – Training Framework: We start Section 3 by presenting our recent work on the generative view of denoising auto-encoders (Section 3.1). We present the walkback algorithm which addresses some of the training difficulties with denoising auto-encoders (Section 3.2).
We then generalize those results by introducing latent variables in the framework to define Generative Stochastic Networks (GSNs) (Section 3.4). GSNs aim to estimate the data-generating distribution indirectly, by parametrizing the transition operator of a Markov chain rather than directly parametrizing a model
of the observed random variable. Most critically, this framework transforms the unsupervised density estimation problem into one which is more similar to supervised function approximation
. This enables training by (possibly regularized) maximum likelihood and gradient descent computed via simple back-propagation, avoiding the need to compute intractable partition functions. Depending on the model, this may allow us to draw from any number of recently demonstrated supervised training tricks. For example, one could use a convolutional architecture with max-pooling for parametric parsimony and computational efficiency, or dropout(Hinton et al., 2012)
to prevent co-adaptation of hidden representations.
3 – General theory: Training the generative (decoding / denoising) component of a GSN with noisy representation is often far easier than modeling explicitly (compare the blue and red distributions in Figure 1). We prove that if our estimated is consistent (e.g. through maximum likelihood), then the stationary distribution of the resulting Markov chain is a consistent estimator of the data-generating density, (Section 3.1 and Appendix 6).
4 – Consequences of theory: We show that the model is general and extends to a wide range of architectures, including sampling procedures whose computation can be unrolled as a Markov Chain, i.e., architectures that add noise during intermediate computation in order to produce random samples of a desired distribution (Theorem 3.4.1
). An exciting frontier in machine learning is the problem of modeling so-called structured outputs, i.e., modeling a conditional distribution where the output is high-dimensional and has a complex multimodal joint distribution (given the input variable). We show how GSNs can be used to support such structured output and missing values (Section3.6).
5 – Example application: In Section 4.2 we show an example application of the GSN theory to create a deep GSN whose computational graph resembles the one followed by Gibbs sampling in deep Boltzmann machines (with continuous latent variables), but that can be trained efficiently with back-propagated gradients and without layerwise pretraining. Because the Markov Chain is defined over a state that includes latent variables, we reap the dual advantage of more powerful models for a given number of parameters and better mixing in the chain as we add noise to variables representing higher-level information, first suggested by the results obtained by Bengio et al. (2013b) and Luo et al. (2013). The experimental results show that such a model with latent states indeed mixes better than shallower models without them (Table 1).
6 – Dependency networks: Finally, an unexpected result falls out of the GSN theory: it allows us to provide a novel justification for dependency networks (Heckerman et al., 2000) and for the first time define a proper joint distribution between all the visible variables that is learned by such models (Section 3.8).
2 Summing over too many major modes
The approach presented in this paper is motivated by a difficulty often encountered with probabilistic models, especially those containing anonymous latent variables. They are called anonymous because no a priori semantics are assigned to them, like in Boltzmann machines, and unlike in many knowledge-based graphical models. Whereas inference over non-anonymous latent variables is required to make sense of the model, anonymous variables are only a device to capture the structure of the distribution and need not have a clear human-readable meaning.
However, graphical models with latent variables often require dealing with either or both of the following fundamentally difficult problems in the inner loop of training, or to actually use the model for making decisions: inference (estimating the posterior distribution over latent variables given inputs ) and sampling (from the joint model of and ). However, if the posterior has a huge number of modes that matter, then the approximations made may break down.
Many of the computations involved in graphical models (inference, sampling, and learning) are made intractable and difficult to approximate because of the large number of non-negligible modes in the modeled distribution (either directly or a joint distribution involving latent variables
). In all of these cases, what is intractable is the computation or approximation of a sum (often weighted by probabilities), such as a marginalization or the estimation of the gradient of the normalization constant. If only a few terms in this sum dominate (corresponding to the dominant modes of the distribution), then many good approximate methods can be found, such as Monte-Carlo Markov chains (MCMC) methods.
Deep Boltzmann machines (Salakhutdinov and Hinton, 2009) combine the difficulty of inference (for the positive phase where one tries to push the energies associated with the observed down) and also that of sampling (for the negative phase where one tries to push up the energies associated with ’s sampled from ). Sampling for the negative phase is usually done by MCMC, although some unsupervised learning algorithms (Collobert and Weston, 2008; Gutmann and Hyvarinen, 2010; Bordes et al., 2013) involve “negative examples” that are sampled through simpler procedures (like perturbations of the observed input, in a spirit reminiscent of the approach presented here). Unfortunately, using an MCMC method to sample from in order to estimate the gradient of the partition function may be seriously hurt by the presence of a large number of important modes, as argued below.
To evade the problem of highly multimodal joint or posterior distributions, the currently known approaches to dealing with the above intractable sums make very strong explicit assumptions (in the parametrization) or implicit assumptions (by the choice of approximation methods) on the form of the distribution of interest. In particular, MCMC methods are more likely to produce a good estimator if the number of non-negligible modes is small: otherwise the chains would require at least as many MCMC steps as the number of such important modes, times a factor that accounts for the mixing time between modes. Mixing time itself can be very problematic as a trained model becomes sharper, as it approaches a data-generating distribution that may have well-separated and sharp modes (i.e., manifolds) (Bengio et al., 2013b).
We propose to make another assumption that might suffice to bypass this multimodality problem: the effectiveness of function approximation. As is typical in machine learning, we postulate a rather large and flexible family of functions (such as deep neural nets) and then use all manner of tricks to pick a member from that combinatorially large family (i.e. to train the neural net) that both fits observed data and generalizes to unseen data well.
In particular, the GSN approach presented in the next section relies on estimating the transition operator of a Markov chain, e.g. or . Because each step of the Markov chain is generally local, these transition distributions will often include only a very small number of important modes (those in the neighborhood of the previous state). Hence the gradient of their partition function will be easy to approximate. For example consider the denoising transitions studied by Bengio et al. (2013c) and illustrated in Figure 1, where is a stochastically corrupted version of and we learn the denoising distribution . In the extreme case (studied empirically here) where is approximated by a unimodal distribution, the only form of training that is required involves function approximation (predicting the clean from the corrupted ).
Although having the true turn out to be unimodal makes it easier to find an appropriate family of models for it, unimodality is by no means required by the GSN framework itself. One may construct a GSN using any multimodal model for output (e.g. mixture of Gaussians, RBMs, NADE, etc.), provided that gradients for the parameters of the model in question can be estimated (e.g. log-likelihood gradients).
The approach proposed here thus avoids the need for a poor approximation of the gradient of the partition function in the inner loop of training, but still has the potential of capturing very rich distributions by relying mostly on “function approximation”.
Besides the approach discussed here, there may well be other very different ways of evading this problem of intractable marginalization, including approaches such as sum-product networks (Poon and Domingos, 2011), which are based on learning a probability function that has a tractable form by construction and yet is from a flexible enough family of distributions. Another interesting direction of investigation that avoids the need for MCMC and intractable partition functions is the variational auto-encoder (Kingma and Welling, 2014; Gregor et al., 2014; Mnih and Gregor, 2014; Rezende et al., 2014) and related directed models (Bornschein and Bengio, 2014; Ozair and Bengio, 2014), which rely on learned approximate inference.
3 Generative Stochastic Networks
In this section we work our way from denoising auto-encoders (DAE) to generative stochastic networks (GSN). We illustrate the usefulness of denoising auto-encoders being applied iteratively as a way to generate samples (and model a distribution). We introduce the walkback training algorithm and show how it can facilitate the training.
We generalize the theory to GSNs, and provide a theorem that serves as a recipe as to how they can be trained. We also reference a classic result from matrix perturbation theory to analyze the behavior of GSNs in terms of their stationary distribution.
We then study how GSNs may be used to fill missing values and theoretical conditions for estimating associated conditional samples. Finally, we connect GSNs to dependency nets and show how the GSN framework fixes one of the main problems with the theoretical analysis of dependency nets and propose a particular way of sampling from them.
3.1 Denoising auto-encoders to model probability distributions
Assume the problem we face is to construct a model for some unknown data-generating distribution given only examples of drawn from that distribution. In many cases, the unknown distribution is complicated, and modeling it directly can be difficult.
A recently proposed approach using denoising auto-encoders (DAE) transforms the difficult task of modeling into a supervised learning problem that may be much easier to solve. The basic approach is as follows: given a clean example data point from , we obtain a corrupted version by sampling from some corruption distribution . For example, we might take a clean image,
, and add random white noise to produce. We then use supervised learning methods to train a function to reconstruct, as accurately as possible, any from the data set given only a noisy version . As shown in Figure 1, the reconstruction distribution may often be much easier to learn than the data distribution , because tends to be dominated by a single or few major modes (such as the roughly Gaussian shaped density in the figure). What we call a major mode is one that is surrounded by a substantial amount of probability mass. There may be a large number of minor modes that can be safely ignored in the context of approximating a distribution, but the major modes should not be missed.
But how does learning the reconstruction distribution help us solve our original problem of modeling ? The two problems are clearly related, because if we knew everything about , then our knowledge of the that we chose would allow us to precisely specify the optimal reconstruction function via Bayes rule: , where is a normalizing constant that does not depend on . As one might hope, the relation is also true in the opposite direction: once we pick a method of adding noise, , knowledge of the corresponding reconstruction distribution is sufficient to recover the density of the data .
In a recent paper, Alain and Bengio (2013) showed that denoising auto-encoders with small Gaussian corruption and squared error loss estimated the score (derivative of the log-density with respect to the input) of continuous observed random variables, thus implicitly estimating . The following Proposition 3.1
generalizes this to arbitrary variables (discrete, continuous or both), arbitrary corruption (not necessarily asymptotically small), and arbitrary loss function (so long as they can be seen as a log-likelihood).
Let be the training distribution for which we only have empirical samples. Let be the fixed corruption distribution and be the trained reconstruction distribution (assumed to have sufficient capacity). We define a Markov chain that starts at some and then iteratively samples pairs of values by alternatively sampling from and from .
Let be the stationary distribution of this Markov chain when we consider only the sequence of values of .
If we assume that this Markov chain is irreducible, that its stationary distribution exists, and if we assume that is the distribution that minimizes optimally the following expected loss
then we have that the stationary distribution is the same as the training distribution . If we look at the density that we get for by applying to the training data from , we can rewrite the loss as a KL divergence
where the constant is independent of . This expression is maximized when we have a that satisfies
In that case, we have that
where represents the true conditional that we get through the usual application of Bayes’ rule.
Now, when we sample iteratively between and to get the Markov chain illustrated above, we are performing Gibbs sampling. We understand what Gibbs sampling does, and here we are sampling using the two possible ways of expressing the joint from equation (1). This means that the stationary distribution of the Markov chain will have as marginal density when we look only at the component of the chain.
Beyond proving that is sufficient to reconstruct the data density, Proposition 3.1
also demonstrates a method of sampling from a learned, parametrized model of the density,, by running a Markov chain that alternately adds noise using and denoises by sampling from the learned , which is trained to approximate the true .
Before moving on, we should pause to make an important point clear. Alert readers may have noticed that and can each be used to reconstruct the other given knowledge of . Further, if we assume that we have chosen a simple (say, a uniform Gaussian with a single width parameter), then and must both be of approximately the same complexity. Put another way, we can never hope to combine a simple and a simple to model a complex . Nonetheless, it may still be the case that is easier to model than due to reduced computational complexity in computing or approximating the partition functions of the conditional distribution mapping corrupted input to the distribution of corresponding clean input . Indeed, because that conditional is going to be mostly assigning probability to locally around , has only one or a few major modes, while can have a very large number of them.
So where did the complexity go? has fewer major modes than , but the location of these modes depends on the value of . It is precisely this mapping from mode location that allows us to trade a difficult density modeling problem for a supervised function approximation problem that admits application of many of the usual supervised learning tricks.
In the Gaussian noise example, what happens is that the tails of the Gaussian are exponentially damping all but the modes that are near , thus preserving the actual number of modes but considerably changing the number of major modes. In the Appendix we also present one alternative line of reasoning based on a corruption process that has finite local support, thus completely removing the modes that are not in the neighborhood of . We argue that even with such a corruption process, the stationary distribution will match the original , so long as one can still visit all the regions of interest through a sequence of such local jumps.
Two potential issues with Proposition 3.1 are that 1) we are learning distribution based on experimental samples so it is only asymptotically minimizing the desired loss, and 2) we may not have enough capacity in our model to estimate perfectly.
The issue is that, when running a Markov chain for infinitely long using a slightly imperfect , these small differences may affect the stationary distribution and compound over time. We are not allowed to “adjust” the as the chain runs.
3.2 Walkback algorithm for training denoising auto-encoders
In this section we describe the walkback algorithm which is very similar to the method from Proposition 1, but helps training to converge faster. It differs in the training samples that are used, and the fact that the solution is obtained through an iterative process. The parameter update changes the corruption function, which changes the in the training samples, which influences the next parameter update, and so on.
Sampling in high-dimensional spaces (like in experiments in Section 4.1) using a simple local corruption process (such as Gaussian or salt-and-pepper noise) suggests that if the corruption is too local, the DAE’s behavior far from the training examples can create spurious modes in the regions insufficiently visited during training. More training iterations or increasing the amount of corruption noise helps to substantially alleviate that problem, but we discovered an even bigger boost by training the Markov chain to walk back towards the training examples (see Figure 2).
We exploit knowledge of the currently learned model to define the corruption, so as to pick values of that would be obtained by following the generative chain: wherever the model would go if we sampled using the generative Markov chain starting at a training example , we consider to be a kind of “negative example” from which the auto-encoder should move away (and towards ). The spirit of this procedure is thus very similar to the CD-
(Contrastive Divergence withMCMC steps) procedure proposed to train RBMs (Hinton, 1999; Hinton et al., 2006).
We start by defining the modified corruption process that samples times alternating between and the current .
We can express this recursively if we let be our original , and then define
Note that this corruption distribution now involves the distribution that we are learning.
With the help of the above definition of , we define the walkback corruption process . To sample from , we first draw a
distributed according to some distribution, e.g., a geometric distribution with parameterand support on ), and then we sample according to the corresponding . Other values than
could be used, but we just want something convenient for that hyperparameter. Conceptually, the corruption processmeans that, from a starting point we apply iteratively the original and , and then we flip a coin to determine if we want to do it again. We re-apply until we lose the coin flip, and then this gives us a final value for the sample based on .
The walkback loss is given by
for samples drawn from , and . Minimizing this loss is an iterative process because the samples used in the empirical expression depend on the parameter to be learned. This iterated minimization is what we call the walkback algorithm. Samples are generated with the current parameter value , and then the parameters are modified to reduce the loss and yield . We repeat until the process stabilizes. In practical applications, we do not have infinite-capacity models and we do not have a guarantee that the walkback algorithm should converge to some .
3.2.1 Reparametrization Trick
Note that we do not need to analytically marginalize over the latent variables involved: we can back-propagate through the chain, considering it like a recurrent neural network with noise (the corruption) injected in it. This is an instance of the so-called reparametrization trick, already proposed in(Bengio, 2013; Kingma, 2013; Kingma and Welling, 2014). The idea is that we can consider sampling from a random variable conditionally on others (such as given ) as equivalent to applying a deterministic function taking as argument the conditioning variables as well as some i.i.d. noise sources. This view is particularly useful for the more general GSNs introduced later, in which we typically choose the latent variables to be continuous, i.e., allowing to backprop through their sampling steps when exploiting the reparametrization trick.
3.2.2 Equivalence of the Walkback Procedure
With the walkback algorithm, one can also decide to include or not in the loss function all the intermediate reconstruction distributions through which the trajectories pass. That is, starting from some , we sample
and we use all the pairs as training data for the walkback loss at equation (3).
The following proposition looks very similar to Proposition 3.1, but it uses the walkback corruption instead of the original corruption . It is also an iterated process through which the current value of the parameter sets the loss function that will be minimized by the updated .
Let be the training distribution for which we only have empirical samples. Let be the implicitly defined asymptotic distribution of the Markov chain alternating sampling from and , where is the original local corruption process.
If we assume that has sufficient capacity and that the walkback algorithm converges (in terms of being stable in the updates to ), then .
That is, the Markov chain defined by alternating and gives us samples that are drawn from the same distribution as the training data.
Consider that during training, we produce a sequence of estimators where corresponds to the -th training iteration (modifying the parameters after each iteration). With the walkback algorithm, is used to obtain the corrupted samples from which the next model is produced.
If training converges in terms of , it means that we have found a value of such that
for samples drawn from , .
By Proposition 3.1, we know that, regardless of the the corruption used, when we have a that minimizes optimally the loss
then we can recover by alternating between and .
Therefore, once the model is trained with walkback, the stationary distribution of the Markov chain that it creates has the same distribution as the training data.
Hence if we alternate between the original corruption and the walkback solution , then the stationary distribution with respect to is also .
Note that this proposition applies regardless of the value of geometric distribution used to determine how many steps of corruption will be used. It applies whether we keep all the samples along to the way, or only the one at the last step. It applies regardless of if we use a geometric distribution to determine which to select, or any other type of distribution.
A consequence is that the walkback training algorithm estimates the same distribution as the original denoising algorithm, but may do it more efficiently (as we observe in the experiments), by exploring the space of corruptions in a way that spends more time where it most helps the model to kill off spurious modes.
The Markov chain that we get with walkback should also generally mix faster, be less susceptible to getting stuck in bad modes, but it will require a with more capacity than originally. This is because is now less local, covering the values of the initial that could have given rise to the resulting from several steps of the Markov chain.
3.3 Walkbacks with individual scaling factors to handle uncertainty
The use of the proposed walkback training procedure is effective in suppressing the spurious modes in the learned data distribution. Although the convergence is guaranteed asymptotically, in practice, given limited model capacity and training data, it has been observed that the more walkbacks in training, the more difficult it is to maximize . This is simply because more and more noise is added in this procedure, resulting in that is further away from , therefore a potentially more complicated reconstruction distribution.
In other words, needs to have the capacity to model increasingly complex reconstruction distributions. As a result of training, a simple, or usually unimodal is most likely to learn a distribution with a larger uncertainty than the one learned without walkbacks in order to distribute some probability mass to the more complicated and multimodal distributions implied by the walkback training procedure. One possible solution to this problem is to use a multimodal reconstruction distribution such as in Ozair et al. (2014), Larochelle and Murray (2011), or Dinh et al. (2015). We propose here another solution, which can be combined with the above, that consists in allowing a different level of entropy for different steps of the walkback.
3.3.1 Scaling trick in binary
In the case of binary , the most common choice of the reconstruction distribution is the factorized Multinoulli distribution where and is the dimensionality of . Each factor
is modeled by a Bernoulli distribution that has its parameterwhere is a general nonlinear transformation realized by a neural network. We propose to use a different scaling factor for different walkback steps, resulting in a new parameterization for the k-th walkback step, with being learned.
effectively scales the pre-activation of the sigmoid function according to the uncertainty or entropy associated with different walkback steps. Naturally, later reconstructions in the walkback sequence are less accurate because more noise has been injected. Hence, given the-th and -th walkback steps that satisfy , the learning will tend to result in because larger correspond to less entropy.
3.3.2 Scaling trick in real-valued
In the case of real-valued , the most common choice of is the factorized Gaussian. In particular, each factor
is modeled by a Normal distribution with its parametersand . Using the same idea of learning separate scaling factors, we can parametrize it as for the -th walkback step. is positive and also learned. However, Given the -th and -th walkback steps that satisfy , the learning will result , since in this case, larger indicates larger entropy.
3.3.3 Sampling with the learned scaling factors
After learning the scaling factors for different walkback steps, the sampling is straightforward. One noticeable difference is that we have learned Markov transition operators. Although, asymptotically all Markov chains generate the same distribution of , in practice, they result in different distributions because of the different learned. In fact, using results the samples that are sharper and more faithful to the data distribution. We verify the effect of learning the scaling factor further in the experimental section.
3.4 Extending the denoising auto-encoder to more general GSNs
The denoising auto-encoder Markov chain is defined by and , where alone can serve as the state of the chain. The GSN framework generalizes the DAE in two ways:
the “corruption” function is not fixed anymore but a parametrized function that can be learned and corresponds to a “hidden” state (so we write the output of this function rather than ); and
that intermediate variable is now considered part of the state of the Markov chain, i.e., its value of at step of the chain depends not just on the previous visible but also on the previous state .
For this purpose, we define the Markov chain associated with a GSN in terms of a visible and a latent variable as state variables, of the form
This definition makes denoising auto-encoders a special case of GSNs. Note that, given that the distribution of may depend on a previous value of , we find ourselves with an extra variable added at the beginning of the chain. This complicates things when it comes to training, but when we are in a sampling regime we can simply wait a sufficient number of steps to burn in.
3.4.1 Main result about GSNs
The next theoretical results give conditions for making the stationary distributions of the above Markov chain match a target data-generating distribution. It basically says that, in order to estimate the data-generating distribution , it is enough to achieve two conditions.
The first condition is similar to the one we obtain when minimizing denoising reconstruction error, i.e., we must make sure that the reconstruction distribution approaches the conditional distribution , i.e., the ’s that could have given rise to .
The second condition is novel and regards the initial state of the chain, which influences . It says that must match . One way to achieve that is to initialize associated with a training example with the previous value of that was sampled when example was processed. In the graphical model in the statement of Theorem 3.4.1, note how the arc relating and goes in the direction, which is different from the way we would sample from the GSN (graphical model above), where we have . Indeed, during training, is given, forcing it to have the data-generating distribution.
Note that Theorem 3.4.1 is there to provide us with a guarantee about what happens when those two conditions are satisfied. It is not originally meant to describe a training method.
In section 3.4.3 we explain how to these conditions could be approximately achieved.
Let be the Markov chain defined by the following graphical model.
If we assume that the chain has a stationary distribution , and that for every value of we have that
all the share the same density for
all the shared the same density for
then for every value of we get that
holds, which is something that was assumed only for
the stationary distribution has a marginal distribution such that .
Those conclusions show that our Markov chain has the property that its samples in are drawn from the same distribution as .
The proof hinges on a few manipulations done with the first variables to show that , which is assumed for , also holds for .
For all we have that
The equality in distribution between and is obtained with
Then we can use this to conclude that
so, despite the arrow in the graphical model being turned the other way, we have that the density of is the same as for all other with .
Now, since the distribution of is the same as the distribution of , and the transition probability is entirely defined by the densities which are found at every step for all , then we know that will have the same distribution as . To make this point more explicitly,
This also holds for and for all subsequent . This relies on the crucial step where we demonstrate that . Once this was shown, then we know that we are using the same transitions expressed in terms of at every step.
Since the distribution of was shown above to be the same as the distribution of , this forms a recursive argument that shows that all the are equal in distribution to . Because describes every , we have that all the joints are equal in distribution to .
This implies that the stationary distribution is the same as that of . Their marginals with respect to are thus the same.
Intuitively, the proof of Theorem 3.4.1 achieves its objective by forcing all the pairs to share the same joint distribution, thus making the marginal over as (i.e. the stationary distribution of the chain ) be the same as , i.e., the data distribution. On the other hand, because it is a Markov chain, its stationary distribution does not depend on the initial conditions, making the model generate from an estimator of for any initial condition.
To apply Theorem 3.4.1 in a context where we use experimental data to learn a model, we would like to have certain guarantees concerning the robustness of the stationary density . When a model lacks capacity, or when it has seen only a finite number of training examples, that model can be viewed as a perturbed version of the exact quantities found in the statement of Theorem 3.4.1.
3.4.2 A note about consistency
A good overview of results from perturbation theory discussing stationary distributions in finite state Markov chains can be found in (Cho et al., 2000). We reference here only one of those results.
Adapted from (Schweitzer, 1968)
Let be the transition matrix of a finite state, irreducible, homogeneous Markov chain. Let be its stationary distribution vector so that . Let and where is the square matrix whose columns all contain . Then, if is any transition matrix (that also satisfies the irreducible and homogeneous conditions) with stationary distribution , we have that
This theorem covers the case of discrete data by showing how the stationary distribution is not disturbed by a great amount when the transition probabilities that we learn are close to their correct values. We are talking here about the transition between steps of the chain , which are defined in Theorem 3.4.1 through the densities.
3.4.3 Training criterion for GSNs
So far we avoided discussing the training criterion for a GSN. Various alternatives exist, but this analysis is for future work. Right now Theorem 3.4.1 suggests the following rules :
Define , i.e., the decoder, to be the estimator for , e.g. by training an estimator of this conditional distribution from the samples , with reconstruction likelihood, , as this would asymptotically achieve the condition . To see that this is true, consider the following.
We sample from (the data-generating distribution) and from . Refer to one of the next bullet points for an explanation about how to get values for to be used when sampling from here. This creates a joint distribution over that has as a derived conditional. Then we train the parameters of a model to maximize the log-likelihood
where the constant does not depend on , and thus the log-likelihood is maximized when
Pick the transition distribution to be useful, i.e., training it towards the same objective, i.e., sampling an that makes it easy to reconstruct . One can think of as the encoder, except that it has a state which depends on its previous value in the chain.
To approach the condition , one interesting possibility is the following. For each in the training set, iteratively sample and substitute the value of as the updated value of . Repeat until you have achieved a kind of “burn in”. Note that, after the training is completed, when we use the chain for sampling, the samples that we get from its stationary distribution do not depend on . Another option is to store the value of that was sampled for the particular training example , and re-use it as the initial the next time that is presented during training. These techniques of substituting into are only required during training. In our experiments, we actually found that a fixed worked as well, so we have used this simpler approach in the reported experiments.
The rest of the chain for is defined in terms of .
3.5 Random variable as deterministic function of noise
There several equivalent ways of expressing a GSN. One of the interesting formulations is to use deterministic functions of random variables to express the densities used in Theorem 3.4.1. With that approach, we define for some independent noise source , and we insist that cannot be recovered exactly from , to avoid a situation in which the Markov chain would not be ergodic. The advantage of that formulation is that one can directly back-propagate the reconstruction log-likelihood into all the parameters of and , using the reparametrization trick discussed above in Section 3.2.1. This method is described in (Williams, 1992).
In the setting described at the beginning of Section 3, the function playing the role of the “encoder” was fixed for the purpose of the theorem, and we showed that learning only the “decoder” part (but a sufficiently expressive one) sufficed. In this setting we are learning both, which can cause certain broken behavior.
One problem would be if the created Markov chain failed to converge to a stationary distribution. Another such problem could be that the function learned would try to ignore the noise , or not make the best use out of it. In that case, the reconstruction distribution would simply converge to a Dirac at the input . This is the analogue of the constraint on auto-encoders that is needed to prevent them from learning the identity function. Here, we must design the family from which and are learned such that when the noise is injected, there are always several possible values of that could have been the correct original input.
Another extreme case to think about is when is overwhelmed by the noise and has lost all information about . In that case the theorems are still applicable while giving uninteresting results: the learner must capture the full distribution of in because the latter is now equivalent to , since no longer contains information about . This illustrates that when the noise is large, the reconstruction distribution (parametrized by ) will need to have the expressive power to represent multiple modes. Otherwise, the reconstruction will tend to capture an average output, which would visually look like a fuzzy combination of actual modes. In the experiments performed here, we have only considered unimodal reconstruction distributions (with factorized outputs), because we expect that even if is not unimodal, it would be dominated by a single mode when the noise level is small. However, future work should investigate multimodal alternatives.
A related element to keep in mind is that one should pick the family of conditional distributions so that one can sample from them and one can easily train them when given pairs, e.g., by maximum likelihood.
3.6 Handling missing inputs or structured output
In general, a simple way to deal with missing inputs is to clamp the observed inputs and then run the Markov chain with the constraint that the observed inputs are fixed and not resampled at each time step, whereas the unobserved inputs are resampled each time, conditioned on the clamped inputs.
In the context of the GSN described in Section 3.4 using the two distributions
we need to make some adjustments to to be able to sample conditioned on some of its components being clamped. We also focus on the case where there are no connections between the . That is, we study the more basic situation where we train an denoising auto-encoder instead of a GSN that has connections between the hidden units.
Let be a set of values that can take. For example, can be a subset of the units of that are fixed to given values. We can talk about clamping , or just “clamping ” when the meaning is clear. In order to sample from a distribution with clamped , we need to be able to sample from
This notation might be strange at first, but it’s as legitimate as conditioning on when sampling from any general distribution. It involves only a renormalization of the resulting distribution .
In a general scenario with two conditional distributions playing the roles of and , i.e. the encoder and decoder, we can make certain basic assumptions so that the asymptotic distributions of and both exist. There is no reason to think that those two distributions are the same, and it is trivial to construct counter-examples where they differ greatly.
However, when we train a DAE with infinite capacity, Proposition 3.1 shows that the optimal solution leads to those two joints being the same. That is, the two trained conditional distributions and are mutually compatible. They form a single joint distribution over . We can sample from it by the usual Gibbs sampling procedure. Moreover, the marginal distribution over that we obtain will match that of the training data. This is the motivation for Proposition 3.1.
Knowing that Gibbs sampling produces the desired joint distribution over , we can now see how it would be possible to sample from if we are able to sample from and . Note that it might be very hard to sample from , depending on the particular model used. We are not making any assumption on the factorization of , much like we are not making any assumption on the particular representation (or implementation) of .
In section 3.4.2 we address a valid concern about the possibility that, in a practical setting, we might not train to achieve an exact match the density of . That may be very close to the optimum, but it might not be able to achieve it due to its finite capacity or its particular parametrization. What does that imply about whether the asymptotic distribution of the Markov chain obtained experimentally compared to the exact joint ?
We deal with this issue in the same way as we dealt with it when it arose in the context of Theorem 3.4.1. The best that we can do is to refer to Theorem 3.4.2 and rely on an argument made in the context of discrete states that would closely approximate our situation (which is in either discrete or continuous space).
Our Markov chain is homogeneous because it does not change with time. It can be made irreducible my imposing very light constraints on so that for all . This happens automatically when we take to be additive Gaussian noise (with fixed parameters) and we train only . In that case, the optimum will assign non-zero probability weight on all the values of .
We cannot guarantee that a non-optimal will not be broken in some way, but we can often get to be non-zero by selecting a parametrized model that cannot assign a probability of exactly zero to an . Finally, to use Theorem 3.4.2 we need to have that the constant from that Theorem 3.4.2
to be non-zero. This is a bit more complicated to enforce, but it is something that we will get if the transition matrix stays away from the identity matrix. That constant is zero when the chain is close to being degenerate.
Theorem 3.4.2 says that, with those conditions verified, we have that an arbitrarily good will lead to an arbitrarily good approximation of the exact joint .
Now that we know that this approach is grounded in sound theory, it is certainly reasonable to try it in experimental settings in which we are not satisfying all the requirements, and see if the results are useful or not. We would refer the reader to our experiment shown in Figure 6 where we clamp certain units and resample the rest.
To further understand the conditions for obtaining the appropriate conditional distributions on some of the visible inputs when others are clamped, we consider below sufficient and necessary conditions for making the stationary distribution of the clamped chain correspond to the normalized distribution (over the allowed values) of the unclamped chain.
Let and be the encoder and decoder functions such that they are mutually compatible (i.e. they represent a single joint distribution for which we can sample using Gibbs sampling). Let denote that joint.
Note that this happens when we minimize
or when we minimize the walkback loss (see Proposition 3.2.2).
Let be a set of values that can take (e.g. some components of can be assigned certain fixed values), and such that . Let denote the conditional distribution of on which we marginalize over and condition on . That is,
Let denote a restriction of the decoder function that puts probability weight only on the values of . That is,